xpk: add team-based PoC quota routing with ConfigMap discovery#1180
xpk: add team-based PoC quota routing with ConfigMap discovery#1180ultrons wants to merge 19 commits into
Conversation
Adds three flags to `xpk workload create`:
--team routes the job to <namespace>/<lq>/<priorityClass>
--value-class job-class label for audit and priority ordering
--declared-duration-minutes honest p90 estimate, used by a time-limit controller
When --team is set:
- Pod template is placed in the team's namespace (poc-<team>) and labeled with
the team's LocalQueue and PriorityClass.
- declared-duration-minutes is propagated into the pod template's metadata
labels (Kueue does not copy arbitrary JobSet metadata to the Workload, but
it does copy pod-template metadata into spec.podSets[*].template).
When --team is unset, behavior is identical to upstream (default namespace,
multislice-queue).
Team routing initially uses a hardcoded POC_TEAM_CONFIG dict in
core/kueue_manager.py; a follow-up commit replaces this with discovery from a
cluster-side ConfigMap.
Adds `xpk workload status --team=<t> --workload=<name>` (or omit --workload to list all). The command tells the user, with one diagnostic, whether their workload is: - QUEUED (and what's ahead of them) - STUCK (quota reserved but admission failing) - RUNNING - FINISHED For STUCK workloads the command parses recent Warning events and surfaces the most common causes (e.g. slice name length over the super-slice 49-char limit), with a copy-pasteable shorter-name suggestion. Looks up workloads by the xpk.google.com/workload label so the user can pass their full display name regardless of any internal name shortening. Cluster credentials are fetched at runtime; the user no longer needs compute/zone in their gcloud config for this command to work.
Three related changes that move team configuration from hardcoded dicts to
runtime discovery, eliminating xpk releases as the gating step for cluster
admin changes:
1. Decouple xpk --workload from the K8s JobSet name. xpk now derives a short,
deterministic K8s name (`{ldap_prefix}-{hex4}`) that fits the super-slice
admission controller's 49-char limit, while keeping the user-facing display
name on the JobSet via the xpk.google.com/workload label. Users pass the
same display name to `workload status / delete / list` regardless of length.
2. Replace POC_TEAM_CONFIG and POC_TEAM_MAX_WORKLOAD_NAME dicts (and the two
argparse choices=[...] lists) with discovery from a kueue-system/poc-team-config
ConfigMap. New module core/poc_discovery.py: fetch_poc_config(),
resolve_team(), max_k8s_workload_name_len(), available_teams(),
available_value_classes(). When --team is unset there is no cluster call —
upstream behavior is preserved bit-for-bit. When --team is set with a value
not on the cluster, xpk now prints the live available list as the error.
3. Add an opportunistic local cache under ~/.xpk/poc-cache/<context>.json so
neither tab completion (argcomplete completers on --cluster, --team,
--value-class) nor did-you-mean suggestions (`Did you mean: ml-perf?`) need
to call the cluster on every keystroke. The cache is refreshed any time a
live ConfigMap fetch succeeds; if it is missing or stale the worst case is a
slightly less useful error message — never a wrong-routing decision.
When a PoC team is in use, args.workload is the user-friendly display name (e.g. mlperf-conv-13) and the actual K8s JobSet name is a derived short name (e.g. mlperf-d6cc). The "Follow your workload here:" and "Follow your worker 0, slice 0 logs here:" URLs were hardcoding namespace="default" and using args.workload for the pod prefix, so they pointed nowhere — the workload lives in the team namespace and its pods are named after the JobSet, not args.workload. This commit: - Initializes poc_namespace/k8s_name early in workload_create so the PoC values are available where the URLs are emitted. - Substitutes those values into both the dashboard URL and the Cloud Logging filter (namespace_name and pod_name prefix). - Falls back to the upstream behavior (namespace=default, prefix= args.workload) when no PoC team is set, so non-PoC users are unaffected.
jamOne-
left a comment
There was a problem hiding this comment.
Thanks for the PR!
- Please also resolve the failing presubmits.
- Please add unit tests to your changes.
| _SUPER_SLICING_WORKLOAD_NAME_LIMIT = 28 | ||
|
|
||
|
|
||
| def _load_poc_cfg(args) -> dict | None: |
There was a problem hiding this comment.
What is PoC and why do we name our code after it?
There was a problem hiding this comment.
Fair point — "PoC" was an internal project name and shouldn't leak into upstream code. Renamed throughout: poc_discovery.py → quota_discovery.py, _resolve_poc_team → _resolve_quota_team, _load_poc_cfg → _load_quota_cfg, _poc_cfg → _quota_cfg, _poc_labels → _team_labels, etc. All user-facing strings (parser help text, error messages, log lines) now use neutral wording like "team-quota" / "team-routing". The cluster-side ConfigMap name poc-team-config is kept as-is for now since renaming it is a chart-side change that needs separate coordination — happy to do that as a follow-up.
There was a problem hiding this comment.
Nice, thanks!
The poc-team-config name, let's change that as well please. We can make the name configurable in the config like: xpk config set team-configmap-name poc-team-config if you do not want to update your cluster.
You can read the name like here:
Line 39 in 20e3da5
And prepare the config key like here:
Lines 59 to 67 in 20e3da5
There was a problem hiding this comment.
Done in 3a9ac68. Default is now team-quota-config (no poc- literal anywhere). Operators with the legacy ConfigMap can override without touching the cluster: xpk config set team-configmap-name poc-team-config. Wired exactly per your references — added TEAM_CONFIGMAP_NAME_KEY to core/config.py's DEFAULT_KEYS, and quota_discovery._configmap_name() reads it via
get_config().get(...) with the default as fallback. +2 tests covering both the default and the override path.
| # For PoC teams, inject the user's full display name as XPK_WORKLOAD_NAME so | ||
| # training scripts can reference it for GCS artifact paths regardless of the | ||
| # short K8s name xpk derives for the JobSet. | ||
| if getattr(args, 'team', None): | ||
| args.env.setdefault('XPK_WORKLOAD_NAME', args.workload) |
There was a problem hiding this comment.
This looks like a hack. Do we need this name shortening?
There was a problem hiding this comment.
The shortening exists because the super-slice admission controller has a 49-char hard limit on the slice name it creates from each JobSet (format: -jobset---slice-job-). Subtracting the fixed framing (~26 chars) and the namespace length leaves a small budget for the user-facing workload name — e.g. for poc-ml-perf (11 chars), only 12 chars are available. User-friendly names like mlperf-conv-13 (14 chars) blow past this, the JobSet/Slice fails admission, and the user sees a confusing low-level error. Without shortening, this is a real foot-gun.
Updated this commit to:
- Add an inline comment explaining the constraint and pointing at quota_discovery.max_k8s_workload_name_len for the budget math.
- Add --no-shorten-jobset-name so users can opt out when their workload name is short enough already.
- Only print the "workload → JobSet name" mapping when shortening actually triggered (no-op cases stay silent).
There was a problem hiding this comment.
@FIoannides I need your review on that one. IMO this decision can be breaking for some users.
| ) | ||
|
|
||
| poc_namespace, poc_local_queue, poc_priority = _resolve_poc_team(args) | ||
| # Derive a short K8s-safe JobSet name for PoC teams; use display name as-is otherwise. |
There was a problem hiding this comment.
But why? This can be unexpected from a user perspective. We have also other commands that take workload as an argument, e.g. xpk inspector.
So it actually required for the --teams changes?
There was a problem hiding this comment.
Yes — same root cause as the previous comment. With --team, the workload routes to a per-team namespace, which both forces the JobSet to be created in that namespace AND eats into the slice-name char budget. Both behaviors only kick in when --team is set; the upstream non-team path is unchanged. Added a docstring + inline comment explaining the conditional. Plus the --no-shorten-jobset-name opt-out flag for users who'd rather see the create fail loudly than have their name silently shortened
Renamings:
- core/poc_discovery.py -> core/quota_discovery.py
- _resolve_poc_team -> _resolve_quota_team, _load_poc_cfg -> _load_quota_cfg,
fetch_poc_config -> fetch_quota_config
- args._poc_cfg -> args._quota_cfg
- _build_poc_labels -> _build_team_labels (and pod-template variant)
- {poc_labels} / {poc_pod_template_labels} YAML keys -> {team_labels} /
{team_pod_template_labels}
- Local cache dir: ~/.xpk/poc-cache -> ~/.xpk/quota-cache
- All "PoC" references in user-facing strings, docstrings and parser help
text replaced with neutral team-quota wording.
Structure:
- Replace ambiguous (str, str, str) tuple return of resolve_team() with
@DataClass TeamRouting{namespace, local_queue, priority_class}.
- Move workload_status() and its helpers from commands/workload.py to a
dedicated commands/workload_status.py.
- In core/kueue_manager.py, move derive_k8s_workload_name() below the
module-level constants block so the constants stay together at the top.
- Tightened "team is set" check to a real non-empty-string test (still
passes existing MagicMock-based tests).
UX:
- Add --no-shorten-jobset-name flag so the JobSet-name shortening is
opt-out for users whose workload names already fit the super-slice
admission charLimit budget. Justify the shortening with an inline
comment that points at quota_discovery.max_k8s_workload_name_len for
the budget math.
- Only print the "workload -> JobSet name" message when the name was
actually shortened.
Tests:
- New core/quota_discovery_test.py (17 tests): resolve_team /
available_teams / available_value_classes / max_k8s_workload_name_len /
suggest / fetch_quota_config (mocked).
- Append derive_k8s_workload_name tests to core/kueue_manager_test.py
(5 tests): determinism, length cap, ldap-prefix shape, distinct inputs,
short-input behavior.
- All 502 tests pass (+23 new).
- Goldens regenerated for the renamed YAML template variables.
Added 23 new tests:
All 502 tests pass locally. Goldens regenerated for the renamed YAML template variables. |
Update — review feedback addressed (commit 4e284ca)Thanks for the thorough review @jamOne-! Here's the round-trip on each comment. Renamings (response to "What is PoC and why...")
Cluster-side ConfigMap name Structure / code quality
Justifications added (
|
| File | Change |
|---|---|
core/poc_discovery.py → core/quota_discovery.py |
renamed (with git mv) + dataclass + cleaned docstring |
core/quota_discovery_test.py |
new, 17 tests |
core/local_cache.py |
docstring + cache dir name |
core/kueue_manager.py |
moved derive_k8s_workload_name below CONSTS |
core/kueue_manager_test.py |
+6 tests for derive_k8s_workload_name |
commands/workload.py |
rename + dataclass call sites + opt-out flag handling |
commands/workload_status.py |
new (split from workload.py) |
parser/workload.py |
import from new module + neutral help text + new flag |
recipes/Workload_create*.md |
regenerated goldens for renamed YAML template variables |
- workload.py: drop unused run_command_for_value import; add explicit protected-access pylint pragmas for args._quota_cfg (using args as a namespace bag is intentional); hoist the TeamRouting import to module level instead of importing it inside _resolve_quota_team. - workload_status.py: add check=False to subprocess.run; drop unused re import + unused regex match; the deferred .workload import (kept to avoid circular import) gets an explicit import-outside-toplevel pragma. - local_cache.py: add check=False to both subprocess.run calls. - kueue_manager_test.py: hoist derive_k8s_workload_name to the grouped imports block and use rsplit(maxsplit=1) for the suffix-length checks. - quota_discovery_test.py: replace `== []` empty-list comparisons with `not <list>` to match pylint's use-implicit-booleaness preference. Local pylint on the touched files now scores 10.00/10. All 502 tests still pass. Goldens unchanged.
- workload.py: drop unused run_command_for_value import; add explicit protected-access pylint pragmas for args._quota_cfg (using args as a namespace bag is intentional); hoist the TeamRouting import to module level instead of importing it inside _resolve_quota_team. - workload_status.py: add check=False to subprocess.run; drop unused re import + unused regex match; the deferred .workload import (kept to avoid circular import) gets an explicit import-outside-toplevel pragma. - local_cache.py: add check=False to both subprocess.run calls. - kueue_manager_test.py: hoist derive_k8s_workload_name to the grouped imports block and use rsplit(maxsplit=1) for the suffix-length checks. - quota_discovery_test.py: replace `== []` empty-list comparisons with `not <list>` to match pylint's use-implicit-booleaness preference. Local pylint on the touched files now scores 10.00/10. All 502 tests still pass. Goldens unchanged.
- Narrow json.loads return to dict in local_cache.read and quota_discovery.fetch_quota_config to satisfy strict no-any-return. - Tighten _load_quota_cfg cache-hit type check to isinstance(cached, dict). - Suppress 'Action has no attribute completer' on argcomplete .completer assignments in parser/workload.py (existing pattern from common.py).
Pyink split the prior single-line assignment across three lines, which moved the '# type: ignore[attr-defined]' off the actual attribute access and left mypy flagging both the original error and the now-misplaced ignore comment. Hoist the Action into a local so the assignment fits on one line.
| help='The name of the cluster to delete the job on.', | ||
| required=True, | ||
| ) | ||
| ).completer = _cluster_completer # type: ignore[attr-defined] |
There was a problem hiding this comment.
Please do not by-pass the type-checker like this.
- If it's needed add a coment above why.
- But please prefer to resolve the issue so it does not introduce unexpected errors in the future.
There was a problem hiding this comment.
Fair callout, point 2 isn't fully resolved by my change, Here is where things stand:
What's done in 7d77cc5 + 383b3a3:
- Every # type: ignore[attr-defined] removed at the call sites.
- New _set_completer(action, completer) helper centralises the dynamic-attribute assignment.
- Helper docstring (expanded in 383b3a3) explicitly acknowledges that setattr is itself a bypass that mypy can't introspect, and explains why no purely-static fix is currently available: argcomplete monkey-patches argparse.Action.completer at runtime, and there is no upstream type stub that declares it.
Two options for actually resolving the bypass:
- Wait for argcomplete to ship updated stubs declaring Action.completer. Out of our hands.
- Shadow Action with a Protocol that declares completer: Any and cast at the helper boundary. This is a few extra lines but makes the "trust me, this attribute exists" contract explicit to mypy. I haven't done this because it'd diverge from the existing pattern at parser/common.py:71, which uses a different escape hatch (untyped add_argument returning Any), wanted to check before introducing a new pattern just for these new flags.
Happy to switch to (2) if you'd prefer it here, or to tackle it as a cross-cutting cleanup (consistent across both new code and parser/common.py) in a follow-up. Let me know which
There was a problem hiding this comment.
Add argument types in those helper functions, otherwise they are hard to follow
There was a problem hiding this comment.
Done in 7d77cc5 — every inner helper in workload_status.py is now fully annotated.
Note: this file is moving to its own PR (see your :1151 comment) so the annotated version lives on the workload-status-pr branch.
| # --------------------------------------------------------------------------- | ||
| # derive_k8s_workload_name | ||
| # --------------------------------------------------------------------------- |
There was a problem hiding this comment.
Please remove we do not use this kind of comments
|
|
||
| # Fill project from gcloud config if not provided. | ||
| if not getattr(args, 'project', None): | ||
| r = _subprocess.run( |
There was a problem hiding this comment.
Why not using run_command_for_value?
There was a problem hiding this comment.
Makes sense! fix applied. This whole file is removed from PR #1180 per the :1151 split, so it's no longer in the diff here. The change is on the workload-status-pr branch (lines 57-62) and will land with the follow-up PR:
was: subprocess.run(['gcloud', 'config', 'get', 'project'], ...)
rc, out = run_command_for_value(
'gcloud config get project',
task='gcloud config get project',
quiet=True,
)
args.project = out.strip().splitlines()[-1] if rc == 0 and out else ''
--
|
|
||
| # Imported lazily to avoid a circular import: commands/workload.py imports | ||
| # this module's parser, and we'd otherwise close the loop. | ||
| from .workload import _resolve_quota_team # pylint: disable=import-outside-toplevel |
There was a problem hiding this comment.
Move the quota function to another file then. Do not disable linter checks...
There was a problem hiding this comment.
Done in 7d77cc5. The two quota helpers _load_quota_cfg and _resolve_quota_team (now load_quota_cfg and resolve_team_for_args) moved from commands/workload.py to core/quota_discovery.py — their natural home. Both call sites import from there at module top: no import-outside-toplevel, no pylint disable. Side benefit: args._quota_cfg → args.quota_cfg drops the
matching protected-access disable too.
| return f'{s // 60}m' | ||
| return f'{s // 3600}h{(s % 3600) // 60}m' | ||
|
|
||
| def _cond(wl, ctype): |
There was a problem hiding this comment.
Maybe we could use
xpk/src/xpk/core/kubectl_common.py
Line 38 in 20e3da5
There was a problem hiding this comment.
Done in 7d77cc5. workload_status._cond / _is_true now use kubectl_common.parse_kubernetes_status for typed condition access. Added a reason field to KubernetesCondition (with one new test in kubectl_common_test.py) so the typed access covers all four fields workload_status reads. (Also moving to the separate PR.)
| # core.kueue_manager.derive_k8s_workload_name and | ||
| # core.quota_discovery.max_k8s_workload_name_len for the budget math. | ||
| # The user can disable shortening with --no-shorten-jobset-name. | ||
| if team_namespace: |
There was a problem hiding this comment.
Why only when team_namespace is used? Seems inconsistent to me.
There was a problem hiding this comment.
Intentional. This PR is meant to be purely additive: workloads submitted without --team keep their original K8s name and don't go through any of the new code paths, so existing xpk users see zero behavior change. The shortening solves a problem specific to the team-routing path (because namespace prefix eats most of the super-slice charLimit budget), so it's gated on that path. Happy to extend it to non-team workloads in a follow-up PR if you'd prefer that — but doing so here would change naming behavior for every existing xpk user, which felt riskier than warranted for this PR.
| # For PoC teams, inject the user's full display name as XPK_WORKLOAD_NAME so | ||
| # training scripts can reference it for GCS artifact paths regardless of the | ||
| # short K8s name xpk derives for the JobSet. | ||
| if getattr(args, 'team', None): | ||
| args.env.setdefault('XPK_WORKLOAD_NAME', args.workload) |
There was a problem hiding this comment.
@FIoannides I need your review on that one. IMO this decision can be breaking for some users.
Bucket 1 — small, mechanical fixes: - kueue_manager_test.py: drop the decorative banner comments. - workload_status.py: type-annotate every inner helper. - workload_status.py: replace direct subprocess.run for the gcloud-project lookup with run_command_for_value, matching the rest of the file. - workload_status.py: condition handling now goes through kubectl_common.parse_kubernetes_status (typed access for type/status/lastTransitionTime/message/reason). - parser/workload.py: drop every '# type: ignore[attr-defined]' on argcomplete .completer assignments. New _set_completer helper uses setattr so the assignment is explicit and the type-checker is not bypassed. Bucket 2 — refactor to eliminate the workload_status -> workload lazy import: - Move load_quota_cfg + resolve_team_for_args from commands/workload.py to core/quota_discovery.py (their natural home). resolve_team_for_args now returns Optional[TeamRouting]; callers handle the unset case according to their own semantics. - workload.py and workload_status.py both import from quota_discovery at module top — no more 'pylint: disable=import-outside-toplevel'. - Rename args._quota_cfg -> args.quota_cfg (drops the protected-access pylint disable too). Supporting change: - kubectl_common.KubernetesCondition gains a 'reason' field (parse_kubernetes_status populates it). Lets workload_status use the shared helper end-to-end without falling back to dict access. Verified locally: 502 tests pass, pyink/mypy/pylint clean (mypy 0 errors; pylint score unchanged from main), goldens unchanged.
…n PR Per review feedback (jamOne-, workload.py:1151): the workload status diagnostic is unrelated to the team-routing changes that motivate this PR and should land on its own. Removing the file and the parser hookup here — the workload_status work is preserved on a stacked branch and will be opened as a separate PR once AI-Hypercomputer#1180 merges.
Per review feedback (jamOne-, workload.py:119): with the variable/file
renames done in round 1, the literal string "poc-team-config" was the
last `poc-` reference left in upstream code. Two changes:
- Default ConfigMap name is now `team-quota-config` (neutral). xpk's
fetch_quota_config queries `kueue-system/team-quota-config` by default.
- A new xpk config key, `team-configmap-name`, lets operators override the
name without touching the cluster. Existing deployments running the
legacy `poc-team-config` ConfigMap can stay on it via:
xpk config set team-configmap-name poc-team-config
Implementation follows the pattern jamOne- referenced (telemetry.py /
config.py): the override is a key in DEFAULT_KEYS, read via
`get_config().get(TEAM_CONFIGMAP_NAME_KEY)` with the default name as
fallback. The error message in load_quota_cfg references both the
resolved name and the override mechanism so users can self-serve when
they hit a missing-ConfigMap error.
Tests: +2 (default name used when no override; override is honored when
set). 504 tests pass total.
Per review feedback (jamOne-, parser/workload.py:1). Adds a new section to docs/usage/workloads.md covering the team-routing flags introduced by this PR: --team, --value-class, --declared-duration-minutes, and --no-shorten-jobset-name. Explains the auto-shortening (super-slice charLimit) and the team-configmap-name override for operators with legacy ConfigMap names.
Per round-2 follow-up review: the prior docstring described the helper
as 'avoiding a per-call-site type-ignore' but did not name what it
actually is — setattr is itself a dynamic-attribute bypass that mypy
cannot statically check. Updated the docstring to:
- acknowledge setattr is a bypass, just centralised
- explain why no purely-static fix is available without external
changes (argcomplete monkey-patches Action.completer at runtime;
no upstream stub declares it)
- list the two options for resolving (upstream stubs / Protocol+cast)
- offer to switch to Protocol+cast on request
No code change — docstring only.
jamOne-
left a comment
There was a problem hiding this comment.
Gemini review below:
The reviewer agent has completed the review of the pull request. Here is the verdict:
Code Review Verdict
Status: 🔴 Changes Requested
Overall Impression: The introduction of team-based routing and a runtime configurable quota discovery mechanism is a fantastic architectural improvement, successfully decoupling xpk releases from cluster administration. The code is well-structured, well-tested, and provides robust caching for responsive autocompletion. However, there are a few edge cases related to input validation, logic constraints, and shell injection hygiene that must be addressed before merging.
Finding Summary
| ID | Severity | Category | File/Location | Finding (Concise) |
|---|---|---|---|---|
| F1 | 🔴 Critical | Security | src/xpk/core/quota_discovery.py (fetch_quota_config) |
The customizable team-configmap-name is directly interpolated into a shell command, creating a shell injection vulnerability. |
| F2 | 🟡 Important | Logic | src/xpk/core/kueue_manager.py (derive_k8s_workload_name) |
derive_k8s_workload_name incorrectly uses negative slicing if max_len < 5, violating the max length constraint. |
| F3 | 🟡 Important | Validation | src/xpk/core/quota_discovery.py (load_quota_cfg) |
The --declared-duration-minutes flag is not validated as required when --team is set, despite documentation stating otherwise. |
Detailed Findings
F1 - Prevent shell injection in kubectl command execution
- Location:
src/xpk/core/quota_discovery.pyaround line 68 (fetch_quota_config) - Why it matters: The
cm_namevariable is retrieved from the user's config viaget_config().get(TEAM_CONFIGMAP_NAME_KEY)and directly interpolated into thecmdstring, which is executed viarun_command_for_value. While the user controls their own config, this still permits arbitrary local code execution if a malformed or malicious ConfigMap name (e.g.test; rm -rf /) is set in the XPK config. - How to fix: Use
shlex.quoteto properly escape the ConfigMap name before string interpolation.
import shlex
def fetch_quota_config() -> dict | None:
# ...
cm_name = shlex.quote(_configmap_name())
cmd = f'kubectl get configmap -n {CONFIG_CM_NAMESPACE} {cm_name} -o json'F2 - Fix negative slicing bug in derive_k8s_workload_name
- Location:
src/xpk/core/kueue_manager.pyaround line 84 - Why it matters: The function computes
max_prefix = max_len - 5and then slices the string usingldap[:max_prefix]. Ifmax_lenis less than 5 (which is possible if a cluster admin misconfigures thesliceNamecharacter limits or uses an exceptionally long namespace),max_prefixbecomes negative. In Python, a negative slice index counts from the end of the string, which causes the function to return a string much longer thanmax_len, thus silently violating the super-slice Kubernetes character limits and causing downstream admission check failures. - How to fix: Ensure
max_lenis at least 5 before attempting to generate the prefix and hash, raising an explicit error if it's too small.
def derive_k8s_workload_name(display_name: str, max_len: int) -> str:
# ...
if max_len < 5:
raise ValueError(f"Calculated max_len ({max_len}) is too small to accommodate the required short name format. Please check the 'sliceName' character limits in the team-quota ConfigMap.")
# Reserve 5 chars for '-' + 4 hex digits
max_prefix = max_len - 5
ldap = display_name.split("-")[0]
prefix = ldap[:max_prefix]
hex4 = hashlib.sha256(display_name.encode()).hexdigest()[:4]
return f"{prefix}-{hex4}"F3 - Enforce --declared-duration-minutes when --team is specified
- Location:
src/xpk/core/quota_discovery.pyinsideload_quota_cfg - Why it matters: The
argparsehelp text explicitly states that--declared-duration-minutesis "Required when --team is set." However,load_quota_cfgdoes not enforce this. If a user omits the flag, the workload might be submitted without the duration label, preventing the time-limit controller from stopping overrunning jobs or functioning correctly. - How to fix: Add explicit validation to ensure the duration value is provided whenever the team flag is utilized.
def load_quota_cfg(args) -> dict | None:
# ...
if args.team not in (cfg.get('teams') or {}):
# ... existing team not found error handling ...
xpk_exit(1)
if getattr(args, 'declared_duration_minutes', None) is None:
xpk_print(
'ERROR: --declared-duration-minutes is required when --team is set.'
)
xpk_exit(1)Mechanical cleanups from jamOne-'s round-3 review: - docs/usage/workloads.md: drop the filler sentence above the flag table (the table is self-explanatory). - commands/workload.py: remove the empty-value comment above team_namespace init; consolidate the XPK_WORKLOAD_NAME env-var setdefault under the existing `if team_routing is not None:` branch instead of a separate getattr/isinstance check earlier in the function. - core/local_cache.py: route current_context() and gke_contexts_from_kubeconfig() through run_command_for_value (with quiet=True so tab completion isn't corrupted by xpk_print). Removes the last raw subprocess.run sites in this file, matching the codebase's command-helper pattern. - parser/workload.py: add explicit argument types to _cluster_completer and the inner _complete in _cached_field_completer.
Round-3 review AI-Hypercomputer#6/AI-Hypercomputer#7. Match parser/common.py:71's pattern: each add_argument(...).completer = X is now a single chain expression at the call site instead of going through a helper that wrapped setattr. Mypy stays clean by relaxing the parameter type on the two enclosing parser-setup functions (set_workload_delete_parser and set_workload_list_parser) to match this file's existing convention in add_shared_workload_create_required_arguments — untyped parser param so Any.add_argument().completer = X works for mypy. Removes the _set_completer wrapper, its docstring, and the typing.Any need at that boundary.
…uting recipe Round-3 review AI-Hypercomputer#2. Lets reviewers (and downstream users) exercise the team-routing path without deploying the cluster-side ConfigMap, and gives the PR a concrete entry point that shows the rendered JobSet with the team's namespace, priorityClass, and Kueue/team labels. Mirrors the existing DRY_RUN_RESERVATION_SUB_BLOCKS env-var pattern in core/reservation.py (modeled on reviewer's pointer). When set, the env var's JSON payload is treated as the contents of data["config.json"] from the live team-quota ConfigMap; otherwise xpk falls through to the live `kubectl get configmap` path (so live behavior is unchanged when the env var is unset). The recipe golden-tests via tools/recipes.py and produces a 190-line xpk dry-run output that highlights the team-routing fields (namespace, priorityClass, queue-name, team/value-class/declared-duration-minutes labels, the workload-name shortening for super-slice charLimit, and the XPK_WORKLOAD_NAME env injection).
|
Pushed three commits addressing round-3 on top of 383b3a3:
Local CI: mypy, pyink, pylint, pytest (504), and tools/recipes.py golden all green. Ready for re-review. |
…#1180 review 4219697499) Three findings from the automated Gemini review jamOne- ran on top of round-3 (review 4219697499): F1 — shell injection in fetch_quota_config. cm_name comes from xpk config (xpk config set team-configmap-name <name>) and was interpolated into a shell command run via run_command_for_value (shell=True). A malicious or malformed config value (e.g. 'name; rm -rf /') would have been executed verbatim. Shell-quote with shlex.quote. F2 — negative-slicing bug in derive_k8s_workload_name. With max_len < 5, max_prefix = max_len - 5 went negative, ldap[:max_prefix] removed characters from the end of the prefix instead, and the resulting {prefix}-{hex4} silently exceeded max_len. Triggers when the team-quota ConfigMap's sliceName.charLimit is set too tightly relative to a long namespace. Now raises ValueError with a clear message pointing at the ConfigMap field, so the misconfiguration is caught up-front instead of at admission time. F3 — --declared-duration-minutes not enforced when --team is set. The flag's help text says "Required when --team is set"; load_quota_cfg didn't actually check, so a workload could be submitted with no duration label and the cluster-side time-limit controller would have nothing to gate on. Validation now matches the help text. Tests: +4 new (shell-injection-via-quoted-cm-name, derive raises for small max_len, declared-duration-minutes required/accepted with --team). 508 pass.
|
Pushed 65c4ec4 addressing the three Gemini findings:
+4 unit tests covering each case. mypy, pyink, pylint, pytest (508), and tools/recipes.py golden all ok locally. |
Two CI-only failures from the previous push: 1. linter (W0404/C0415): test_derive_k8s_workload_name_raises_when_max_len_too_small redundantly imported pytest inside the function. pytest is already at module level (line 19); use that. 2. verify-goldens: the recipe captured my local working directory (/home/sivaibhav_google_com/xpk-fork) into the dry-run output's "Adding ... to container image archive" line, which CI prints as /home/runner/work/xpk/xpk. Add --script-dir=/tmp to the recipe command to normalize the path, matching the convention in Workload_create_super-slicing.md and others. Regenerated the golden. Verified locally with the same env CI uses (XPK_VERSION_OVERRIDE=v0.0.0 only, no TELEMETRY_TRASH_EXECUTION).
jamOne-
left a comment
There was a problem hiding this comment.
Gemini review below. Please resolve F1 and F2.
Here is the review of the pull request #1180:
Code Review Verdict
Status: 🔴 Changes Requested
Overall Impression: The PR introduces a robust and flexible team-based Kueue routing mechanism via ConfigMaps and implements an elegant local caching strategy for sub-second CLI tab completions. The codebase is well-structured and follows clean abstraction principles. However, there are critical logic errors in the JobSet name derivation algorithm that will cause the Kubernetes API to reject workloads under specific constraints, as well as a shell injection vulnerability when executing commands with user-provided config values.
Finding Summary
| ID | Severity | Category | File/Location | Finding (Concise) |
|---|---|---|---|---|
| F1 | 🔴 Critical | Logic | src/xpk/core/kueue_manager.py |
Negative string slicing in derive_k8s_workload_name produces invalid or excessively long names when max_len < 6. |
| F2 | 🟡 Important | Security | src/xpk/core/quota_discovery.py |
Missing shell escaping (shlex.quote) for the ConfigMap name risks arbitrary command execution. |
| F3 | 🔵 Suggestion | Error Handling | src/xpk/core/local_cache.py |
Overly broad exception suppression (except Exception) masks legitimate filesystem or permission errors. |
Detailed Findings
F1 - Logic Error in Workload Name Derivation
- Location:
src/xpk/core/kueue_manager.py:86(inderive_k8s_workload_name) - Why it matters: The slice length math relies on
max_prefix = max_len - 5. If a team has a long namespace (e.g., > 18 characters),max_lencan drop below 5, makingmax_prefixnegative.- Negative Slicing Bug: In Python,
ldap[:negative_number]removes characters from the end of the string rather than capping the front (e.g.,"amandaliang"[:-1]becomes"amandalian"). This creates a name that is longer thanmax_len, violating the strict 49-character super-slice controller limit and causing loud admission failures. - RFC 1123 Violation: If
max_len == 5,max_prefix == 0.ldap[:0]evaluates to"", resulting in a derived name like"-abcd". Kubernetes API enforces RFC 1123, which strictly requires object names to start with an alphanumeric character, not a hyphen.
- Negative Slicing Bug: In Python,
- How to fix: Ensure
max_lenis sufficiently large (at least 6) and safely handle edge cases before attempting to slice the string.
def derive_k8s_workload_name(display_name: str, max_len: int) -> str:
if max_len < 6:
raise ValueError(
f"max_len ({max_len}) is too small to accommodate a valid "
"Kubernetes name with the deterministic hash suffix. Check the "
"namespace length and sliceName limits."
)
# Reserve 5 chars for '-' + 4 hex digits
max_prefix = max_len - 5
ldap = display_name.split("-")[0]
prefix = ldap[:max_prefix]
hex4 = hashlib.sha256(display_name.encode()).hexdigest()[:4]
return f"{prefix}-{hex4}"F2 - Command Injection Vulnerability via Missing Escaping
- Location:
src/xpk/core/quota_discovery.py:75(infetch_quota_config) - Why it matters: The ConfigMap name is resolved from local configuration (
get_config().get(TEAM_CONFIGMAP_NAME_KEY)) and immediately interpolated into akubectlcommand executed withshell=True(viarun_command_for_value). If a user configures theirteam-configmap-namewith shell metacharacters (e.g.,xpk config set team-configmap-name "foo; rm -rf /"), it will execute arbitrary commands on their machine. Whilexpkrelies heavily onshell=True, configuration-driven injection is an avoidable risk. - How to fix: Use
shlex.quote()to safely escape the dynamically retrieved ConfigMap name.
import shlex
def fetch_quota_config() -> dict | None:
# ...
cm_name = shlex.quote(_configmap_name())
cmd = f'kubectl get configmap -n {CONFIG_CM_NAMESPACE} {cm_name} -o json'
rc, out = run_command_for_value(cmd, task=cmd, quiet=True)
# ...F3 - Overly Broad Exception Suppression in Local Cache
- Location:
src/xpk/core/local_cache.py(inwrite,read,all_contexts) - Why it matters: The cache operations wrap their execution in a broad
except Exception: pass(or returningNone). While making the cache "best effort" is a good architectural choice, silently catching allExceptions suppresses logical bugs (likeNameError,TypeError) alongside legitimateOSErrors (like full disks or permission issues). - How to fix: Narrow the exception scope to catch only expected runtime failure conditions, making debugging significantly easier if the code breaks in the future.
def read(context: str) -> dict | None:
# ...
try:
p = _path_for(context)
if not p.exists():
return None
payload = json.loads(p.read_text())
return payload if isinstance(payload, dict) else None
except (OSError, json.JSONDecodeError):
return None| workload_create_pathways_parser.set_defaults(func=workload_create_pathways) | ||
|
|
||
|
|
||
| def set_workload_delete_parser(workload_delete_parser: ArgumentParser): |
There was a problem hiding this comment.
Oh I see it now why it worked for ParserOrArgumentGroup, it just types the object to Any after add_argument...
Sorry for this confusion, could you please revert removing the types and go back to the original solution (add mypy bypassing comments)? That was the cleanest approach of all that we have available.
scaliby
left a comment
There was a problem hiding this comment.
This change is too big to be reviewed and carries a significant risk. Please work on splitting it into small, incremental PR's that are easy to reason about.
| ' --priority. Valid teams are discovered at runtime from' | ||
| " the cluster's team-quota ConfigMap." | ||
| ), | ||
| ).completer = _cached_field_completer('teams') |
There was a problem hiding this comment.
Please move field caching here and below to a separate PR and make it consistent with other arguments. Otherwise drop this.
.completer = _cached_field_completer('teams')
| from .validators import directory_path_type, name_type | ||
|
|
||
|
|
||
| def _cached_field_completer(field: str) -> Callable[..., list[str]]: |
There was a problem hiding this comment.
please move this to a separate PR. This is not related to team-based routing
| cfg = _sample_cfg() | ||
| mocker.patch('xpk.core.quota_discovery.fetch_quota_config', return_value=cfg) | ||
| mocker.patch('xpk.core.quota_discovery.xpk_print') | ||
| mocker.patch('xpk.core.quota_discovery.xpk_exit', side_effect=SystemExit(1)) |
There was a problem hiding this comment.
xpk_exit by default raises SystemExit. Is this mock necessary?
| must fail loud, not silently emit a workload missing the duration label.""" | ||
| cfg = _sample_cfg() | ||
| mocker.patch('xpk.core.quota_discovery.fetch_quota_config', return_value=cfg) | ||
| mocker.patch('xpk.core.quota_discovery.xpk_print') |
There was a problem hiding this comment.
Why? I think it's fine to just print some stuff as a part of the test.
Add team-based queue routing to
xpk workload createSummary
This adds three flags to
xpk workload createand one new sub-commandxpk workload status, enabling xpk to route a workload to a team-specificKueue ClusterQueue / LocalQueue / PriorityClass on clusters that have been set
up with multi-team quota sharing:
--team=<name>— routes the JobSet to a per-team namespace + LocalQueue +PriorityClass, overriding
--priority.--value-class=<name>— adds a job-class label for downstream audit andordering.
--declared-duration-minutes=<int>— added to the pod-template metadata so acluster-side time-limit controller can stop overrunning jobs.
xpk workload status --team=<t> [--workload=<name>]— diagnoses whether aworkload is QUEUED / STUCK / RUNNING / FINISHED, and on STUCK cases parses
recent Warning events to surface common admission failures (with a
copy-pasteable shorter-name suggestion when the super-slice 49-char limit is
hit).
When
--teamis not set, behavior is bit-identical to upstream — nocluster calls, no new error paths, no new dependencies.
When
--teamis set, xpk reads akueue-system/poc-team-configConfigMapon the target cluster to discover the available teams, namespaces, value
classes, and the super-slice name-length budget. This means adding,
removing, or retuning a team is a cluster-side config change with no xpk
release needed.
Why a ConfigMap, not a hardcoded dict
An earlier iteration of this work (visible in the commit history of the fork)
shipped the team list as a Python dict in
core/kueue_manager.py. That madeevery cluster-side change (adding a team, retuning a quota, fixing a typo in a
PriorityClass name) require an xpk release + every user reinstalling. The
current shape moves the source of truth to the cluster: a small ConfigMap
rendered by the cluster admin's tooling. xpk's view of "what teams exist" is
always whatever
kubectl get cm -n kueue-system poc-team-config -o jsonsaysright now.
User-visible UX additions
Did-you-mean
Tab completion (via existing
argcompleteintegration)--clustercompletes fromkubectl config get-contexts(GKE prefix).--teamand--value-classcomplete from a small per-cluster cache at~/.xpk/poc-cache/<context>.jsonthat is refreshed every time a--teaminvocation succeeds — so completion never blocks on a cluster call.
Workload-name decoupling
Users can now pass any-length
--workloadname. xpk derives a short K8s-safeJobSet name (
{ldap_prefix}-{hex4}deterministic fromsha256(display_name)) that fits the super-slice admission controller's49-char ceiling, and stashes the original on the JobSet via the
xpk.google.com/workloadlabel.xpk workload status / delete / listthenuse the label as the lookup key, so the user never needs to know the
truncated form.
Commits
flags, plumbed into the JobSet template. Initial implementation has the
team table hardcoded.
workload statusPoC queue diagnostic command — the newsub-command and its parser.
replaces the hardcoded table with runtime discovery, decouples the
user-facing workload name from the K8s JobSet name, and adds the local
cache that powers tab completion + did-you-mean. The end state of this
commit is the actual feature the user-facing surface should be reviewed
against; the earlier two are kept separate for reviewability of the layered
design rather than because they ship a useful intermediate state.
Backward compatibility
xpk workload createwithout--teamis unchanged: same default namespace,same
multislice-queue, same priority handling, no extra cluster calls.xpk workload create --helpno longer enumerates a fixedchoices=[...]for
--team/--value-class(because the set is cluster-dependent now);the help text points to runtime discovery. If maintainers prefer
argparse-time validation, an alternative is to keep a non-strict default
list in help text only.
Note on naming
These flags and the
poc-team-configConfigMap are PoC-named throughoutbecause they were developed for a specific TPU quota PoC. The mechanism
itself is general-purpose ("team-aware queue routing for Kueue-enabled
clusters") and could be renamed if upstream prefers — happy to follow up
with a rename once we agree on a final name. Wanted to land the
implementation first to get the design discussion grounded.
Test plan
xpk workload create(no--team): unchanged behavior, no clustercall for team discovery.
xpk workload create --team=<valid>: routes correctly to<namespace>/lq/<priorityClass>; pod-template carries the threelabels.
xpk workload create --team=<typo>: errors with did-you-mean +available list, no JobSet created.
xpk workload create --team=<valid>against a cluster with nopoc-team-configConfigMap: errors with deployment hint, no JobSetcreated.
xpk workload status --team=<t>: lists all workloads in the team'snamespace with QUEUED/STUCK/RUNNING/FINISHED diagnoses and quota
percentages.
xpk workload status --team=<t> --workload=<long-display-name>:resolves to the K8s JobSet via the
xpk.google.com/workloadlabel.register-python-argcomplete xpkthen<TAB>on--cluster,--team,--value-class: returns expected completions.~/.xpk/poc-cache/<context>.jsonisper-cluster; completing on cluster A doesn't suggest teams from
cluster B.
Out of scope
poc-team-configConfigMap(Kueue ClusterQueues, namespaces, time-limit controller, the chart that
renders them) live outside this PR. Reviewers don't need to deploy them
to assess the xpk side; the discovery code is a no-op for any cluster
that doesn't have the ConfigMap and
--teamunset.