Skip to content

xpk: add team-based PoC quota routing with ConfigMap discovery#1180

Open
ultrons wants to merge 19 commits into
AI-Hypercomputer:mainfrom
ultrons:pr-rebased
Open

xpk: add team-based PoC quota routing with ConfigMap discovery#1180
ultrons wants to merge 19 commits into
AI-Hypercomputer:mainfrom
ultrons:pr-rebased

Conversation

@ultrons
Copy link
Copy Markdown

@ultrons ultrons commented Apr 28, 2026

Add team-based queue routing to xpk workload create

Summary

This adds three flags to xpk workload create and one new sub-command
xpk workload status, enabling xpk to route a workload to a team-specific
Kueue ClusterQueue / LocalQueue / PriorityClass on clusters that have been set
up with multi-team quota sharing:

  • --team=<name> — routes the JobSet to a per-team namespace + LocalQueue +
    PriorityClass, overriding --priority.
  • --value-class=<name> — adds a job-class label for downstream audit and
    ordering.
  • --declared-duration-minutes=<int> — added to the pod-template metadata so a
    cluster-side time-limit controller can stop overrunning jobs.
  • xpk workload status --team=<t> [--workload=<name>] — diagnoses whether a
    workload is QUEUED / STUCK / RUNNING / FINISHED, and on STUCK cases parses
    recent Warning events to surface common admission failures (with a
    copy-pasteable shorter-name suggestion when the super-slice 49-char limit is
    hit).

When --team is not set, behavior is bit-identical to upstream — no
cluster calls, no new error paths, no new dependencies.

When --team is set, xpk reads a kueue-system/poc-team-config ConfigMap
on the target cluster to discover the available teams, namespaces, value
classes, and the super-slice name-length budget. This means adding,
removing, or retuning a team is a cluster-side config change with no xpk
release needed
.

Why a ConfigMap, not a hardcoded dict

An earlier iteration of this work (visible in the commit history of the fork)
shipped the team list as a Python dict in core/kueue_manager.py. That made
every cluster-side change (adding a team, retuning a quota, fixing a typo in a
PriorityClass name) require an xpk release + every user reinstalling. The
current shape moves the source of truth to the cluster: a small ConfigMap
rendered by the cluster admin's tooling. xpk's view of "what teams exist" is
always whatever kubectl get cm -n kueue-system poc-team-config -o json says
right now.

User-visible UX additions

Did-you-mean

$ xpk workload create --cluster=foo --team=mlpef ...
ERROR: --team='mlpef' not found on this cluster. Did you mean: ml-perf?
       Available teams: dev, gsc, ml-perf, nightly-regression, scale-test

Tab completion (via existing argcomplete integration)

$ xpk workload create --cluster=bodab<TAB>
  bodaborg-super-alpha-cluster  bodaborg-tpu7x-8
  bodaborg-tpu7x-auto-nap2      bodaborg-tpu7x-nap-auto

$ xpk workload create --cluster=bodaborg-super-alpha-cluster --team=<TAB>
  dev  gsc  ml-perf  nightly-regression  scale-test

--cluster completes from kubectl config get-contexts (GKE prefix).
--team and --value-class complete from a small per-cluster cache at
~/.xpk/poc-cache/<context>.json that is refreshed every time a --team
invocation succeeds — so completion never blocks on a cluster call.

Workload-name decoupling

Users can now pass any-length --workload name. xpk derives a short K8s-safe
JobSet name ({ldap_prefix}-{hex4} deterministic from
sha256(display_name)) that fits the super-slice admission controller's
49-char ceiling, and stashes the original on the JobSet via the
xpk.google.com/workload label. xpk workload status / delete / list then
use the label as the lookup key, so the user never needs to know the
truncated form.

Commits

  1. xpk: add team-based routing flags for PoC quota system — the three new
    flags, plumbed into the JobSet template. Initial implementation has the
    team table hardcoded.
  2. xpk: add workload status PoC queue diagnostic command — the new
    sub-command and its parser.
  3. xpk: ConfigMap-driven team discovery + local cache for completion
    replaces the hardcoded table with runtime discovery, decouples the
    user-facing workload name from the K8s JobSet name, and adds the local
    cache that powers tab completion + did-you-mean. The end state of this
    commit is the actual feature the user-facing surface should be reviewed
    against; the earlier two are kept separate for reviewability of the layered
    design rather than because they ship a useful intermediate state.

Backward compatibility

  • No new required flags anywhere.
  • xpk workload create without --team is unchanged: same default namespace,
    same multislice-queue, same priority handling, no extra cluster calls.
  • xpk workload create --help no longer enumerates a fixed choices=[...]
    for --team / --value-class (because the set is cluster-dependent now);
    the help text points to runtime discovery. If maintainers prefer
    argparse-time validation, an alternative is to keep a non-strict default
    list in help text only.

Note on naming

These flags and the poc-team-config ConfigMap are PoC-named throughout
because they were developed for a specific TPU quota PoC. The mechanism
itself is general-purpose ("team-aware queue routing for Kueue-enabled
clusters") and could be renamed if upstream prefers — happy to follow up
with a rename once we agree on a final name. Wanted to land the
implementation first to get the design discussion grounded.

Test plan

  • xpk workload create (no --team): unchanged behavior, no cluster
    call for team discovery.
  • xpk workload create --team=<valid>: routes correctly to
    <namespace> / lq / <priorityClass>; pod-template carries the three
    labels.
  • xpk workload create --team=<typo>: errors with did-you-mean +
    available list, no JobSet created.
  • xpk workload create --team=<valid> against a cluster with no
    poc-team-config ConfigMap: errors with deployment hint, no JobSet
    created.
  • xpk workload status --team=<t>: lists all workloads in the team's
    namespace with QUEUED/STUCK/RUNNING/FINISHED diagnoses and quota
    percentages.
  • xpk workload status --team=<t> --workload=<long-display-name>:
    resolves to the K8s JobSet via the xpk.google.com/workload label.
  • register-python-argcomplete xpk then <TAB> on --cluster,
    --team, --value-class: returns expected completions.
  • Multi-cluster: cache at ~/.xpk/poc-cache/<context>.json is
    per-cluster; completing on cluster A doesn't suggest teams from
    cluster B.

Out of scope

  • The cluster-side resources that produce the poc-team-config ConfigMap
    (Kueue ClusterQueues, namespaces, time-limit controller, the chart that
    renders them) live outside this PR. Reviewers don't need to deploy them
    to assess the xpk side; the discovery code is a no-op for any cluster
    that doesn't have the ConfigMap and --team unset.

ultrons added 3 commits April 28, 2026 17:20
Adds three flags to `xpk workload create`:
  --team                       routes the job to <namespace>/<lq>/<priorityClass>
  --value-class                job-class label for audit and priority ordering
  --declared-duration-minutes  honest p90 estimate, used by a time-limit controller

When --team is set:
  - Pod template is placed in the team's namespace (poc-<team>) and labeled with
    the team's LocalQueue and PriorityClass.
  - declared-duration-minutes is propagated into the pod template's metadata
    labels (Kueue does not copy arbitrary JobSet metadata to the Workload, but
    it does copy pod-template metadata into spec.podSets[*].template).

When --team is unset, behavior is identical to upstream (default namespace,
multislice-queue).

Team routing initially uses a hardcoded POC_TEAM_CONFIG dict in
core/kueue_manager.py; a follow-up commit replaces this with discovery from a
cluster-side ConfigMap.
Adds `xpk workload status --team=<t> --workload=<name>` (or omit --workload to
list all). The command tells the user, with one diagnostic, whether their
workload is:
  - QUEUED (and what's ahead of them)
  - STUCK (quota reserved but admission failing)
  - RUNNING
  - FINISHED

For STUCK workloads the command parses recent Warning events and surfaces the
most common causes (e.g. slice name length over the super-slice 49-char
limit), with a copy-pasteable shorter-name suggestion.

Looks up workloads by the xpk.google.com/workload label so the user can pass
their full display name regardless of any internal name shortening.

Cluster credentials are fetched at runtime; the user no longer needs
compute/zone in their gcloud config for this command to work.
Three related changes that move team configuration from hardcoded dicts to
runtime discovery, eliminating xpk releases as the gating step for cluster
admin changes:

1. Decouple xpk --workload from the K8s JobSet name. xpk now derives a short,
   deterministic K8s name (`{ldap_prefix}-{hex4}`) that fits the super-slice
   admission controller's 49-char limit, while keeping the user-facing display
   name on the JobSet via the xpk.google.com/workload label. Users pass the
   same display name to `workload status / delete / list` regardless of length.

2. Replace POC_TEAM_CONFIG and POC_TEAM_MAX_WORKLOAD_NAME dicts (and the two
   argparse choices=[...] lists) with discovery from a kueue-system/poc-team-config
   ConfigMap. New module core/poc_discovery.py: fetch_poc_config(),
   resolve_team(), max_k8s_workload_name_len(), available_teams(),
   available_value_classes(). When --team is unset there is no cluster call —
   upstream behavior is preserved bit-for-bit. When --team is set with a value
   not on the cluster, xpk now prints the live available list as the error.

3. Add an opportunistic local cache under ~/.xpk/poc-cache/<context>.json so
   neither tab completion (argcomplete completers on --cluster, --team,
   --value-class) nor did-you-mean suggestions (`Did you mean: ml-perf?`) need
   to call the cluster on every keystroke. The cache is refreshed any time a
   live ConfigMap fetch succeeds; if it is missing or stale the worst case is a
   slightly less useful error message — never a wrong-routing decision.
When a PoC team is in use, args.workload is the user-friendly display
name (e.g. mlperf-conv-13) and the actual K8s JobSet name is a derived
short name (e.g. mlperf-d6cc). The "Follow your workload here:" and
"Follow your worker 0, slice 0 logs here:" URLs were hardcoding
namespace="default" and using args.workload for the pod prefix, so
they pointed nowhere — the workload lives in the team namespace and
its pods are named after the JobSet, not args.workload.

This commit:
- Initializes poc_namespace/k8s_name early in workload_create so the
  PoC values are available where the URLs are emitted.
- Substitutes those values into both the dashboard URL and the Cloud
  Logging filter (namespace_name and pod_name prefix).
- Falls back to the upstream behavior (namespace=default, prefix=
  args.workload) when no PoC team is set, so non-PoC users are
  unaffected.
Copy link
Copy Markdown
Collaborator

@jamOne- jamOne- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!

  1. Please also resolve the failing presubmits.
  2. Please add unit tests to your changes.

Comment thread src/xpk/commands/workload.py Outdated
_SUPER_SLICING_WORKLOAD_NAME_LIMIT = 28


def _load_poc_cfg(args) -> dict | None:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is PoC and why do we name our code after it?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point — "PoC" was an internal project name and shouldn't leak into upstream code. Renamed throughout: poc_discovery.py → quota_discovery.py, _resolve_poc_team → _resolve_quota_team, _load_poc_cfg → _load_quota_cfg, _poc_cfg → _quota_cfg, _poc_labels → _team_labels, etc. All user-facing strings (parser help text, error messages, log lines) now use neutral wording like "team-quota" / "team-routing". The cluster-side ConfigMap name poc-team-config is kept as-is for now since renaming it is a chart-side change that needs separate coordination — happy to do that as a follow-up.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks!

The poc-team-config name, let's change that as well please. We can make the name configurable in the config like: xpk config set team-configmap-name poc-team-config if you do not want to update your cluster.

You can read the name like here:

and get_config().get(SEND_TELEMETRY_KEY) != "false"
.
And prepare the config key like here:
DEFAULT_KEYS = [
CFG_BUCKET_KEY,
CLUSTER_NAME_KEY,
PROJECT_KEY,
CLIENT_ID_KEY,
SEND_TELEMETRY_KEY,
ZONE_KEY,
CUSTOM_BINARIES_PATH_KEY,
]
.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 3a9ac68. Default is now team-quota-config (no poc- literal anywhere). Operators with the legacy ConfigMap can override without touching the cluster: xpk config set team-configmap-name poc-team-config. Wired exactly per your references — added TEAM_CONFIGMAP_NAME_KEY to core/config.py's DEFAULT_KEYS, and quota_discovery._configmap_name() reads it via
get_config().get(...) with the default as fallback. +2 tests covering both the default and the override path.

Comment thread src/xpk/commands/workload.py Outdated
Comment thread src/xpk/commands/workload.py Outdated
Comment on lines +570 to +574
# For PoC teams, inject the user's full display name as XPK_WORKLOAD_NAME so
# training scripts can reference it for GCS artifact paths regardless of the
# short K8s name xpk derives for the JobSet.
if getattr(args, 'team', None):
args.env.setdefault('XPK_WORKLOAD_NAME', args.workload)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a hack. Do we need this name shortening?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The shortening exists because the super-slice admission controller has a 49-char hard limit on the slice name it creates from each JobSet (format: -jobset---slice-job-). Subtracting the fixed framing (~26 chars) and the namespace length leaves a small budget for the user-facing workload name — e.g. for poc-ml-perf (11 chars), only 12 chars are available. User-friendly names like mlperf-conv-13 (14 chars) blow past this, the JobSet/Slice fails admission, and the user sees a confusing low-level error. Without shortening, this is a real foot-gun.

Updated this commit to:

  • Add an inline comment explaining the constraint and pointing at quota_discovery.max_k8s_workload_name_len for the budget math.
  • Add --no-shorten-jobset-name so users can opt out when their workload name is short enough already.
  • Only print the "workload → JobSet name" mapping when shortening actually triggered (no-op cases stay silent).

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@FIoannides I need your review on that one. IMO this decision can be breaking for some users.

Comment thread src/xpk/commands/workload.py Outdated
)

poc_namespace, poc_local_queue, poc_priority = _resolve_poc_team(args)
# Derive a short K8s-safe JobSet name for PoC teams; use display name as-is otherwise.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But why? This can be unexpected from a user perspective. We have also other commands that take workload as an argument, e.g. xpk inspector.

So it actually required for the --teams changes?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes — same root cause as the previous comment. With --team, the workload routes to a per-team namespace, which both forces the JobSet to be created in that namespace AND eats into the slice-name char budget. Both behaviors only kick in when --team is set; the upstream non-team path is unchanged. Added a docstring + inline comment explaining the conditional. Plus the --no-shorten-jobset-name opt-out flag for users who'd rather see the create fail loudly than have their name silently shortened

Comment thread src/xpk/commands/workload.py Outdated
Comment thread src/xpk/commands/workload.py Outdated
Comment thread src/xpk/commands/workload.py Outdated
Comment thread src/xpk/commands/workload.py Outdated
Comment thread src/xpk/core/kueue_manager.py
@jamOne- jamOne- self-assigned this Apr 29, 2026
@jamOne- jamOne- added the release-features features label Apr 29, 2026
Renamings:
- core/poc_discovery.py -> core/quota_discovery.py
- _resolve_poc_team -> _resolve_quota_team, _load_poc_cfg -> _load_quota_cfg,
  fetch_poc_config -> fetch_quota_config
- args._poc_cfg -> args._quota_cfg
- _build_poc_labels -> _build_team_labels (and pod-template variant)
- {poc_labels} / {poc_pod_template_labels} YAML keys -> {team_labels} /
  {team_pod_template_labels}
- Local cache dir: ~/.xpk/poc-cache -> ~/.xpk/quota-cache
- All "PoC" references in user-facing strings, docstrings and parser help
  text replaced with neutral team-quota wording.

Structure:
- Replace ambiguous (str, str, str) tuple return of resolve_team() with
  @DataClass TeamRouting{namespace, local_queue, priority_class}.
- Move workload_status() and its helpers from commands/workload.py to a
  dedicated commands/workload_status.py.
- In core/kueue_manager.py, move derive_k8s_workload_name() below the
  module-level constants block so the constants stay together at the top.
- Tightened "team is set" check to a real non-empty-string test (still
  passes existing MagicMock-based tests).

UX:
- Add --no-shorten-jobset-name flag so the JobSet-name shortening is
  opt-out for users whose workload names already fit the super-slice
  admission charLimit budget. Justify the shortening with an inline
  comment that points at quota_discovery.max_k8s_workload_name_len for
  the budget math.
- Only print the "workload -> JobSet name" message when the name was
  actually shortened.

Tests:
- New core/quota_discovery_test.py (17 tests): resolve_team /
  available_teams / available_value_classes / max_k8s_workload_name_len /
  suggest / fetch_quota_config (mocked).
- Append derive_k8s_workload_name tests to core/kueue_manager_test.py
  (5 tests): determinism, length cap, ldap-prefix shape, distinct inputs,
  short-input behavior.
- All 502 tests pass (+23 new).
- Goldens regenerated for the renamed YAML template variables.
@ultrons
Copy link
Copy Markdown
Author

ultrons commented Apr 29, 2026

Thanks for the PR!

  1. Please also resolve the failing presubmits.
  2. Please add unit tests to your changes.

Added 23 new tests:

  • core/quota_discovery_test.py (new, 17 tests): resolve_team for known/unknown/empty-map cases, available_teams / available_value_classes happy + missing-key paths, max_k8s_workload_name_len with cfg / defaults / partial overrides, suggest did-you-mean ranking, and fetch_quota_config with kubectl output mocked (success / failure / malformed JSON / cache-write side effect).
  • core/kueue_manager_test.py (extended, 6 tests): derive_k8s_workload_name determinism, max_len cap, ldap-prefix shape, distinct inputs producing distinct hashes, short-input still-hashed.

All 502 tests pass locally. Goldens regenerated for the renamed YAML template variables.

@ultrons
Copy link
Copy Markdown
Author

ultrons commented Apr 29, 2026

Update — review feedback addressed (commit 4e284ca)

Thanks for the thorough review @jamOne-! Here's the round-trip on each comment.

Renamings (response to "What is PoC and why...")

PoC was an internal project name and shouldn't have leaked into upstream code. Fully de-PoC-ified throughout:

Before After
core/poc_discovery.py core/quota_discovery.py
_resolve_poc_team _resolve_quota_team
_load_poc_cfg / _poc_cfg _load_quota_cfg / _quota_cfg
fetch_poc_config fetch_quota_config
_build_poc_labels (+ pod-template variant) _build_team_labels (+ pod-template variant)
YAML template {poc_labels} / {poc_pod_template_labels} {team_labels} / {team_pod_template_labels}
~/.xpk/poc-cache/ ~/.xpk/quota-cache/
All "PoC" references in parser help text, docstrings, log messages neutral team-quota wording

Cluster-side ConfigMap name poc-team-config is kept as-is for now (renaming it is a chart-side change that needs separate coordination — happy to do that as a follow-up PR).

Structure / code quality

  • Tuple → dataclass (your :159 comment): resolve_team() now returns a @dataclass(frozen=True) TeamRouting{namespace, local_queue, priority_class}. Callers use attribute access.
  • workload_status → its own file (:1204 comment): moved to commands/workload_status.py. The parser imports from there. workload.py is back to non-status concerns. The decorative banner nits at :1149 and :1202 are gone with it.
  • derive_k8s_workload_name placement (kueue_manager :52): moved below the module-level constants block.

Justifications added (:574, :852)

The K8s name shortening exists because the super-slice admission controller has a 49-char hard limit on slice names of the form <namespace>-jobset-<name>-<kueue-hash>-slice-job-<replica-index>. After the framing overhead and namespace, only ~12 chars are left for the user-facing workload name on a typical team — so mlperf-conv-13 (14 chars) silently fails admission with an opaque K8s error. The shortening is a sharp-edge fix.

To make the behavior less surprising:

  • Inline comment now points at quota_discovery.max_k8s_workload_name_len for the budget math.
  • New --no-shorten-jobset-name opt-out flag.
  • The "workload → JobSet name" log line only prints when shortening actually triggered (no-op cases stay silent).

workload status vs inspector (:1151)

They serve different audiences:

  • xpk inspector = SRE debug dump. Runs ~15 commands collecting cluster-wide state (configmaps, all node pools, kueue/jobset controller deployments + logs, etc.). The --workload arg is largely a label hint; the dump is cluster-wide regardless.
  • xpk workload status = focused user-facing diagnosis. Answers "why is my job stuck?" in a 3-line plain-English answer (RUNNING / QUEUED / STUCK / FINISHED, plus team-quota numbers and a specific fix when an AdmissionCheck error or queue position is detected).

If you'd like I can rename to xpk workload diagnose to make the focused-diagnosis intent obvious, or fold the logic into xpk inspector --workload=X --diagnose-only to keep one surface. Let me know which direction you prefer.

Tests (response to global "write tests for your changes")

+23 new tests, all 502 pass locally + goldens regenerated:

  • core/quota_discovery_test.py (new, 17 tests): resolve_team for known/unknown/empty-map cases, available_teams / available_value_classes happy + missing-key paths, max_k8s_workload_name_len with cfg / defaults / partial overrides, suggest did-you-mean ranking, and fetch_quota_config with kubectl output mocked (success / failure / malformed JSON / cache-write side effect).
  • core/kueue_manager_test.py (extended, 6 tests): derive_k8s_workload_name determinism, max_len cap, ldap-prefix shape, distinct inputs producing distinct hashes, short-input still-hashed.

Files touched in 4e284ca

File Change
core/poc_discovery.pycore/quota_discovery.py renamed (with git mv) + dataclass + cleaned docstring
core/quota_discovery_test.py new, 17 tests
core/local_cache.py docstring + cache dir name
core/kueue_manager.py moved derive_k8s_workload_name below CONSTS
core/kueue_manager_test.py +6 tests for derive_k8s_workload_name
commands/workload.py rename + dataclass call sites + opt-out flag handling
commands/workload_status.py new (split from workload.py)
parser/workload.py import from new module + neutral help text + new flag
recipes/Workload_create*.md regenerated goldens for renamed YAML template variables

ultrons added a commit to ultrons/xpk that referenced this pull request Apr 29, 2026
- workload.py: drop unused run_command_for_value import; add explicit
  protected-access pylint pragmas for args._quota_cfg (using args as a
  namespace bag is intentional); hoist the TeamRouting import to module
  level instead of importing it inside _resolve_quota_team.
- workload_status.py: add check=False to subprocess.run; drop unused
  re import + unused regex match; the deferred .workload import (kept
  to avoid circular import) gets an explicit import-outside-toplevel
  pragma.
- local_cache.py: add check=False to both subprocess.run calls.
- kueue_manager_test.py: hoist derive_k8s_workload_name to the grouped
  imports block and use rsplit(maxsplit=1) for the suffix-length checks.
- quota_discovery_test.py: replace `== []` empty-list comparisons with
  `not <list>` to match pylint's use-implicit-booleaness preference.

Local pylint on the touched files now scores 10.00/10. All 502 tests
still pass. Goldens unchanged.
- workload.py: drop unused run_command_for_value import; add explicit
  protected-access pylint pragmas for args._quota_cfg (using args as a
  namespace bag is intentional); hoist the TeamRouting import to module
  level instead of importing it inside _resolve_quota_team.
- workload_status.py: add check=False to subprocess.run; drop unused
  re import + unused regex match; the deferred .workload import (kept
  to avoid circular import) gets an explicit import-outside-toplevel
  pragma.
- local_cache.py: add check=False to both subprocess.run calls.
- kueue_manager_test.py: hoist derive_k8s_workload_name to the grouped
  imports block and use rsplit(maxsplit=1) for the suffix-length checks.
- quota_discovery_test.py: replace `== []` empty-list comparisons with
  `not <list>` to match pylint's use-implicit-booleaness preference.

Local pylint on the touched files now scores 10.00/10. All 502 tests
still pass. Goldens unchanged.
ultrons added 3 commits April 29, 2026 23:18
- Narrow json.loads return to dict in local_cache.read and
  quota_discovery.fetch_quota_config to satisfy strict no-any-return.
- Tighten _load_quota_cfg cache-hit type check to isinstance(cached, dict).
- Suppress 'Action has no attribute completer' on argcomplete .completer
  assignments in parser/workload.py (existing pattern from common.py).
Pyink split the prior single-line assignment across three lines, which
moved the '# type: ignore[attr-defined]' off the actual attribute access
and left mypy flagging both the original error and the now-misplaced
ignore comment. Hoist the Action into a local so the assignment fits on
one line.
Comment thread src/xpk/parser/workload.py Outdated
help='The name of the cluster to delete the job on.',
required=True,
)
).completer = _cluster_completer # type: ignore[attr-defined]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do not by-pass the type-checker like this.

  1. If it's needed add a coment above why.
  2. But please prefer to resolve the issue so it does not introduce unexpected errors in the future.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair callout, point 2 isn't fully resolved by my change, Here is where things stand:

What's done in 7d77cc5 + 383b3a3:

  • Every # type: ignore[attr-defined] removed at the call sites.
  • New _set_completer(action, completer) helper centralises the dynamic-attribute assignment.
  • Helper docstring (expanded in 383b3a3) explicitly acknowledges that setattr is itself a bypass that mypy can't introspect, and explains why no purely-static fix is currently available: argcomplete monkey-patches argparse.Action.completer at runtime, and there is no upstream type stub that declares it.

Two options for actually resolving the bypass:

  1. Wait for argcomplete to ship updated stubs declaring Action.completer. Out of our hands.
  2. Shadow Action with a Protocol that declares completer: Any and cast at the helper boundary. This is a few extra lines but makes the "trust me, this attribute exists" contract explicit to mypy. I haven't done this because it'd diverge from the existing pattern at parser/common.py:71, which uses a different escape hatch (untyped add_argument returning Any), wanted to check before introducing a new pattern just for these new flags.

Happy to switch to (2) if you'd prefer it here, or to tackle it as a cross-cutting cleanup (consistent across both new code and parser/common.py) in a follow-up. Let me know which

Comment thread src/xpk/commands/workload_status.py Outdated
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add argument types in those helper functions, otherwise they are hard to follow

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 7d77cc5 — every inner helper in workload_status.py is now fully annotated.

Note: this file is moving to its own PR (see your :1151 comment) so the annotated version lives on the workload-status-pr branch.

Comment thread src/xpk/core/kueue_manager_test.py Outdated
Comment on lines +721 to +723
# ---------------------------------------------------------------------------
# derive_k8s_workload_name
# ---------------------------------------------------------------------------
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove we do not use this kind of comments

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed in 7d77cc5.

Comment thread src/xpk/parser/workload.py
Comment thread src/xpk/commands/workload_status.py Outdated

# Fill project from gcloud config if not provided.
if not getattr(args, 'project', None):
r = _subprocess.run(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not using run_command_for_value?

Copy link
Copy Markdown
Author

@ultrons ultrons Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense! fix applied. This whole file is removed from PR #1180 per the :1151 split, so it's no longer in the diff here. The change is on the workload-status-pr branch (lines 57-62) and will land with the follow-up PR:

was: subprocess.run(['gcloud', 'config', 'get', 'project'], ...)
rc, out = run_command_for_value(
'gcloud config get project',
task='gcloud config get project',
quiet=True,
)
args.project = out.strip().splitlines()[-1] if rc == 0 and out else ''
--

Comment thread src/xpk/commands/workload_status.py Outdated

# Imported lazily to avoid a circular import: commands/workload.py imports
# this module's parser, and we'd otherwise close the loop.
from .workload import _resolve_quota_team # pylint: disable=import-outside-toplevel
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move the quota function to another file then. Do not disable linter checks...

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 7d77cc5. The two quota helpers _load_quota_cfg and _resolve_quota_team (now load_quota_cfg and resolve_team_for_args) moved from commands/workload.py to core/quota_discovery.py — their natural home. Both call sites import from there at module top: no import-outside-toplevel, no pylint disable. Side benefit: args._quota_cfg → args.quota_cfg drops the
matching protected-access disable too.

Comment thread src/xpk/commands/workload_status.py Outdated
return f'{s // 60}m'
return f'{s // 3600}h{(s % 3600) // 60}m'

def _cond(wl, ctype):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could use

def parse_kubernetes_status(
?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 7d77cc5. workload_status._cond / _is_true now use kubectl_common.parse_kubernetes_status for typed condition access. Added a reason field to KubernetesCondition (with one new test in kubectl_common_test.py) so the typed access covers all four fields workload_status reads. (Also moving to the separate PR.)

Comment thread src/xpk/commands/workload.py Outdated
# core.kueue_manager.derive_k8s_workload_name and
# core.quota_discovery.max_k8s_workload_name_len for the budget math.
# The user can disable shortening with --no-shorten-jobset-name.
if team_namespace:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why only when team_namespace is used? Seems inconsistent to me.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intentional. This PR is meant to be purely additive: workloads submitted without --team keep their original K8s name and don't go through any of the new code paths, so existing xpk users see zero behavior change. The shortening solves a problem specific to the team-routing path (because namespace prefix eats most of the super-slice charLimit budget), so it's gated on that path. Happy to extend it to non-team workloads in a follow-up PR if you'd prefer that — but doing so here would change naming behavior for every existing xpk user, which felt riskier than warranted for this PR.

Comment thread src/xpk/commands/workload.py Outdated
Comment thread src/xpk/commands/workload.py Outdated
Comment on lines +570 to +574
# For PoC teams, inject the user's full display name as XPK_WORKLOAD_NAME so
# training scripts can reference it for GCS artifact paths regardless of the
# short K8s name xpk derives for the JobSet.
if getattr(args, 'team', None):
args.env.setdefault('XPK_WORKLOAD_NAME', args.workload)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@FIoannides I need your review on that one. IMO this decision can be breaking for some users.

ultrons added 5 commits April 30, 2026 19:37
Bucket 1 — small, mechanical fixes:
- kueue_manager_test.py: drop the decorative banner comments.
- workload_status.py: type-annotate every inner helper.
- workload_status.py: replace direct subprocess.run for the gcloud-project
  lookup with run_command_for_value, matching the rest of the file.
- workload_status.py: condition handling now goes through
  kubectl_common.parse_kubernetes_status (typed access for
  type/status/lastTransitionTime/message/reason).
- parser/workload.py: drop every '# type: ignore[attr-defined]' on
  argcomplete .completer assignments. New _set_completer helper uses
  setattr so the assignment is explicit and the type-checker is not
  bypassed.

Bucket 2 — refactor to eliminate the workload_status -> workload lazy
import:
- Move load_quota_cfg + resolve_team_for_args from commands/workload.py
  to core/quota_discovery.py (their natural home). resolve_team_for_args
  now returns Optional[TeamRouting]; callers handle the unset case
  according to their own semantics.
- workload.py and workload_status.py both import from quota_discovery at
  module top — no more 'pylint: disable=import-outside-toplevel'.
- Rename args._quota_cfg -> args.quota_cfg (drops the protected-access
  pylint disable too).

Supporting change:
- kubectl_common.KubernetesCondition gains a 'reason' field
  (parse_kubernetes_status populates it). Lets workload_status use the
  shared helper end-to-end without falling back to dict access.

Verified locally: 502 tests pass, pyink/mypy/pylint clean (mypy 0
errors; pylint score unchanged from main), goldens unchanged.
…n PR

Per review feedback (jamOne-, workload.py:1151): the workload status
diagnostic is unrelated to the team-routing changes that motivate this
PR and should land on its own. Removing the file and the parser hookup
here — the workload_status work is preserved on a stacked branch and
will be opened as a separate PR once AI-Hypercomputer#1180 merges.
Per review feedback (jamOne-, workload.py:119): with the variable/file
renames done in round 1, the literal string "poc-team-config" was the
last `poc-` reference left in upstream code. Two changes:

- Default ConfigMap name is now `team-quota-config` (neutral). xpk's
  fetch_quota_config queries `kueue-system/team-quota-config` by default.

- A new xpk config key, `team-configmap-name`, lets operators override the
  name without touching the cluster. Existing deployments running the
  legacy `poc-team-config` ConfigMap can stay on it via:

      xpk config set team-configmap-name poc-team-config

Implementation follows the pattern jamOne- referenced (telemetry.py /
config.py): the override is a key in DEFAULT_KEYS, read via
`get_config().get(TEAM_CONFIGMAP_NAME_KEY)` with the default name as
fallback. The error message in load_quota_cfg references both the
resolved name and the override mechanism so users can self-serve when
they hit a missing-ConfigMap error.

Tests: +2 (default name used when no override; override is honored when
set). 504 tests pass total.
Per review feedback (jamOne-, parser/workload.py:1). Adds a new section
to docs/usage/workloads.md covering the team-routing flags introduced
by this PR: --team, --value-class, --declared-duration-minutes, and
--no-shorten-jobset-name. Explains the auto-shortening (super-slice
charLimit) and the team-configmap-name override for operators with
legacy ConfigMap names.
Per round-2 follow-up review: the prior docstring described the helper
as 'avoiding a per-call-site type-ignore' but did not name what it
actually is — setattr is itself a dynamic-attribute bypass that mypy
cannot statically check. Updated the docstring to:

  - acknowledge setattr is a bypass, just centralised
  - explain why no purely-static fix is available without external
    changes (argcomplete monkey-patches Action.completer at runtime;
    no upstream stub declares it)
  - list the two options for resolving (upstream stubs / Protocol+cast)
  - offer to switch to Protocol+cast on request

No code change — docstring only.
@ultrons
Copy link
Copy Markdown
Author

ultrons commented Apr 30, 2026

Thanks for the round-2 review @jamOne-! Pushed dea20f7, 3a9ac68, e4e19a3, 383b3a3 (on top of 7d77cc5 from the previous round). Per-comment responses inline. All four CI checks green.

Copy link
Copy Markdown
Collaborator

@jamOne- jamOne- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gemini review below:

The reviewer agent has completed the review of the pull request. Here is the verdict:

Code Review Verdict

Status: 🔴 Changes Requested

Overall Impression: The introduction of team-based routing and a runtime configurable quota discovery mechanism is a fantastic architectural improvement, successfully decoupling xpk releases from cluster administration. The code is well-structured, well-tested, and provides robust caching for responsive autocompletion. However, there are a few edge cases related to input validation, logic constraints, and shell injection hygiene that must be addressed before merging.

Finding Summary

ID Severity Category File/Location Finding (Concise)
F1 🔴 Critical Security src/xpk/core/quota_discovery.py (fetch_quota_config) The customizable team-configmap-name is directly interpolated into a shell command, creating a shell injection vulnerability.
F2 🟡 Important Logic src/xpk/core/kueue_manager.py (derive_k8s_workload_name) derive_k8s_workload_name incorrectly uses negative slicing if max_len < 5, violating the max length constraint.
F3 🟡 Important Validation src/xpk/core/quota_discovery.py (load_quota_cfg) The --declared-duration-minutes flag is not validated as required when --team is set, despite documentation stating otherwise.

Detailed Findings

F1 - Prevent shell injection in kubectl command execution

  • Location: src/xpk/core/quota_discovery.py around line 68 (fetch_quota_config)
  • Why it matters: The cm_name variable is retrieved from the user's config via get_config().get(TEAM_CONFIGMAP_NAME_KEY) and directly interpolated into the cmd string, which is executed via run_command_for_value. While the user controls their own config, this still permits arbitrary local code execution if a malformed or malicious ConfigMap name (e.g. test; rm -rf /) is set in the XPK config.
  • How to fix: Use shlex.quote to properly escape the ConfigMap name before string interpolation.
import shlex

def fetch_quota_config() -> dict | None:
  # ...
  cm_name = shlex.quote(_configmap_name())
  cmd = f'kubectl get configmap -n {CONFIG_CM_NAMESPACE} {cm_name} -o json'

F2 - Fix negative slicing bug in derive_k8s_workload_name

  • Location: src/xpk/core/kueue_manager.py around line 84
  • Why it matters: The function computes max_prefix = max_len - 5 and then slices the string using ldap[:max_prefix]. If max_len is less than 5 (which is possible if a cluster admin misconfigures the sliceName character limits or uses an exceptionally long namespace), max_prefix becomes negative. In Python, a negative slice index counts from the end of the string, which causes the function to return a string much longer than max_len, thus silently violating the super-slice Kubernetes character limits and causing downstream admission check failures.
  • How to fix: Ensure max_len is at least 5 before attempting to generate the prefix and hash, raising an explicit error if it's too small.
def derive_k8s_workload_name(display_name: str, max_len: int) -> str:
  # ...
  if max_len < 5:
    raise ValueError(f"Calculated max_len ({max_len}) is too small to accommodate the required short name format. Please check the 'sliceName' character limits in the team-quota ConfigMap.")
  # Reserve 5 chars for '-' + 4 hex digits
  max_prefix = max_len - 5
  ldap = display_name.split("-")[0]
  prefix = ldap[:max_prefix]
  hex4 = hashlib.sha256(display_name.encode()).hexdigest()[:4]
  return f"{prefix}-{hex4}"

F3 - Enforce --declared-duration-minutes when --team is specified

  • Location: src/xpk/core/quota_discovery.py inside load_quota_cfg
  • Why it matters: The argparse help text explicitly states that --declared-duration-minutes is "Required when --team is set." However, load_quota_cfg does not enforce this. If a user omits the flag, the workload might be submitted without the duration label, preventing the time-limit controller from stopping overrunning jobs or functioning correctly.
  • How to fix: Add explicit validation to ensure the duration value is provided whenever the team flag is utilized.
def load_quota_cfg(args) -> dict | None:
  # ...
  if args.team not in (cfg.get('teams') or {}):
    # ... existing team not found error handling ...
    xpk_exit(1)
    
  if getattr(args, 'declared_duration_minutes', None) is None:
    xpk_print(
        'ERROR: --declared-duration-minutes is required when --team is set.'
    )
    xpk_exit(1)

Comment thread docs/usage/workloads.md Outdated
Comment thread recipes/Workload_create.md
Comment thread src/xpk/commands/workload.py Outdated
Comment thread src/xpk/commands/workload.py Outdated
Comment thread src/xpk/core/local_cache.py Outdated
Comment thread src/xpk/parser/workload.py Outdated
Comment thread src/xpk/parser/workload.py Outdated
Comment thread src/xpk/parser/workload.py Outdated
ultrons added 3 commits May 4, 2026 17:29
Mechanical cleanups from jamOne-'s round-3 review:

- docs/usage/workloads.md: drop the filler sentence above the flag table
  (the table is self-explanatory).
- commands/workload.py: remove the empty-value comment above team_namespace
  init; consolidate the XPK_WORKLOAD_NAME env-var setdefault under the
  existing `if team_routing is not None:` branch instead of a separate
  getattr/isinstance check earlier in the function.
- core/local_cache.py: route current_context() and gke_contexts_from_kubeconfig()
  through run_command_for_value (with quiet=True so tab completion isn't
  corrupted by xpk_print). Removes the last raw subprocess.run sites in
  this file, matching the codebase's command-helper pattern.
- parser/workload.py: add explicit argument types to _cluster_completer
  and the inner _complete in _cached_field_completer.
Round-3 review AI-Hypercomputer#6/AI-Hypercomputer#7. Match parser/common.py:71's pattern: each
add_argument(...).completer = X is now a single chain expression at
the call site instead of going through a helper that wrapped setattr.

Mypy stays clean by relaxing the parameter type on the two enclosing
parser-setup functions (set_workload_delete_parser and
set_workload_list_parser) to match this file's existing convention
in add_shared_workload_create_required_arguments — untyped parser
param so Any.add_argument().completer = X works for mypy.

Removes the _set_completer wrapper, its docstring, and the typing.Any
need at that boundary.
…uting recipe

Round-3 review AI-Hypercomputer#2. Lets reviewers (and downstream users) exercise the
team-routing path without deploying the cluster-side ConfigMap, and
gives the PR a concrete entry point that shows the rendered JobSet
with the team's namespace, priorityClass, and Kueue/team labels.

Mirrors the existing DRY_RUN_RESERVATION_SUB_BLOCKS env-var pattern
in core/reservation.py (modeled on reviewer's pointer). When set, the
env var's JSON payload is treated as the contents of
data["config.json"] from the live team-quota ConfigMap; otherwise
xpk falls through to the live `kubectl get configmap` path (so live
behavior is unchanged when the env var is unset).

The recipe golden-tests via tools/recipes.py and produces a 190-line
xpk dry-run output that highlights the team-routing fields (namespace,
priorityClass, queue-name, team/value-class/declared-duration-minutes
labels, the workload-name shortening for super-slice charLimit, and
the XPK_WORKLOAD_NAME env injection).
@ultrons
Copy link
Copy Markdown
Author

ultrons commented May 4, 2026

Pushed three commits addressing round-3 on top of 383b3a3:

Local CI: mypy, pyink, pylint, pytest (504), and tools/recipes.py golden all green. Ready for re-review.

…#1180 review 4219697499)

Three findings from the automated Gemini review jamOne- ran on top of
round-3 (review 4219697499):

F1 — shell injection in fetch_quota_config. cm_name comes from xpk
config (xpk config set team-configmap-name <name>) and was interpolated
into a shell command run via run_command_for_value (shell=True). A
malicious or malformed config value (e.g. 'name; rm -rf /') would have
been executed verbatim. Shell-quote with shlex.quote.

F2 — negative-slicing bug in derive_k8s_workload_name. With max_len < 5,
max_prefix = max_len - 5 went negative, ldap[:max_prefix] removed
characters from the end of the prefix instead, and the resulting
{prefix}-{hex4} silently exceeded max_len. Triggers when the team-quota
ConfigMap's sliceName.charLimit is set too tightly relative to a long
namespace. Now raises ValueError with a clear message pointing at the
ConfigMap field, so the misconfiguration is caught up-front instead of
at admission time.

F3 — --declared-duration-minutes not enforced when --team is set. The
flag's help text says "Required when --team is set"; load_quota_cfg
didn't actually check, so a workload could be submitted with no
duration label and the cluster-side time-limit controller would have
nothing to gate on. Validation now matches the help text.

Tests: +4 new (shell-injection-via-quoted-cm-name, derive raises for
small max_len, declared-duration-minutes required/accepted with --team).
508 pass.
@ultrons
Copy link
Copy Markdown
Author

ultrons commented May 4, 2026

Pushed 65c4ec4 addressing the three Gemini findings:

  • F1 (shell injection): shlex.quote on cm_name before kubectl interpolation in fetch_quota_config.
  • F2 (negative slicing): derive_k8s_workload_name now raises ValueError when max_len < 5, surfacing the ConfigMap misconfiguration up-front instead of silently emitting a too-long name.
  • F3 (required-flag enforcement): load_quota_cfg now validates --declared-duration-minutes when --team is set, matching the help text.

+4 unit tests covering each case. mypy, pyink, pylint, pytest (508), and tools/recipes.py golden all ok locally.

Two CI-only failures from the previous push:

1. linter (W0404/C0415): test_derive_k8s_workload_name_raises_when_max_len_too_small
   redundantly imported pytest inside the function. pytest is already at
   module level (line 19); use that.

2. verify-goldens: the recipe captured my local working directory
   (/home/sivaibhav_google_com/xpk-fork) into the dry-run output's
   "Adding ... to container image archive" line, which CI prints as
   /home/runner/work/xpk/xpk. Add --script-dir=/tmp to the recipe
   command to normalize the path, matching the convention in
   Workload_create_super-slicing.md and others. Regenerated the golden.

Verified locally with the same env CI uses (XPK_VERSION_OVERRIDE=v0.0.0
only, no TELEMETRY_TRASH_EXECUTION).
Copy link
Copy Markdown
Collaborator

@jamOne- jamOne- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gemini review below. Please resolve F1 and F2.

Here is the review of the pull request #1180:

Code Review Verdict

Status: 🔴 Changes Requested

Overall Impression: The PR introduces a robust and flexible team-based Kueue routing mechanism via ConfigMaps and implements an elegant local caching strategy for sub-second CLI tab completions. The codebase is well-structured and follows clean abstraction principles. However, there are critical logic errors in the JobSet name derivation algorithm that will cause the Kubernetes API to reject workloads under specific constraints, as well as a shell injection vulnerability when executing commands with user-provided config values.

Finding Summary

ID Severity Category File/Location Finding (Concise)
F1 🔴 Critical Logic src/xpk/core/kueue_manager.py Negative string slicing in derive_k8s_workload_name produces invalid or excessively long names when max_len < 6.
F2 🟡 Important Security src/xpk/core/quota_discovery.py Missing shell escaping (shlex.quote) for the ConfigMap name risks arbitrary command execution.
F3 🔵 Suggestion Error Handling src/xpk/core/local_cache.py Overly broad exception suppression (except Exception) masks legitimate filesystem or permission errors.

Detailed Findings

F1 - Logic Error in Workload Name Derivation

  • Location: src/xpk/core/kueue_manager.py:86 (in derive_k8s_workload_name)
  • Why it matters: The slice length math relies on max_prefix = max_len - 5. If a team has a long namespace (e.g., > 18 characters), max_len can drop below 5, making max_prefix negative.
    1. Negative Slicing Bug: In Python, ldap[:negative_number] removes characters from the end of the string rather than capping the front (e.g., "amandaliang"[:-1] becomes "amandalian"). This creates a name that is longer than max_len, violating the strict 49-character super-slice controller limit and causing loud admission failures.
    2. RFC 1123 Violation: If max_len == 5, max_prefix == 0. ldap[:0] evaluates to "", resulting in a derived name like "-abcd". Kubernetes API enforces RFC 1123, which strictly requires object names to start with an alphanumeric character, not a hyphen.
  • How to fix: Ensure max_len is sufficiently large (at least 6) and safely handle edge cases before attempting to slice the string.
def derive_k8s_workload_name(display_name: str, max_len: int) -> str:
  if max_len < 6:
    raise ValueError(
        f"max_len ({max_len}) is too small to accommodate a valid "
        "Kubernetes name with the deterministic hash suffix. Check the "
        "namespace length and sliceName limits."
    )
  # Reserve 5 chars for '-' + 4 hex digits
  max_prefix = max_len - 5
  ldap = display_name.split("-")[0]
  prefix = ldap[:max_prefix]
  hex4 = hashlib.sha256(display_name.encode()).hexdigest()[:4]
  return f"{prefix}-{hex4}"

F2 - Command Injection Vulnerability via Missing Escaping

  • Location: src/xpk/core/quota_discovery.py:75 (in fetch_quota_config)
  • Why it matters: The ConfigMap name is resolved from local configuration (get_config().get(TEAM_CONFIGMAP_NAME_KEY)) and immediately interpolated into a kubectl command executed with shell=True (via run_command_for_value). If a user configures their team-configmap-name with shell metacharacters (e.g., xpk config set team-configmap-name "foo; rm -rf /"), it will execute arbitrary commands on their machine. While xpk relies heavily on shell=True, configuration-driven injection is an avoidable risk.
  • How to fix: Use shlex.quote() to safely escape the dynamically retrieved ConfigMap name.
import shlex

def fetch_quota_config() -> dict | None:
  # ...
  cm_name = shlex.quote(_configmap_name())
  cmd = f'kubectl get configmap -n {CONFIG_CM_NAMESPACE} {cm_name} -o json'
  rc, out = run_command_for_value(cmd, task=cmd, quiet=True)
  # ...

F3 - Overly Broad Exception Suppression in Local Cache

  • Location: src/xpk/core/local_cache.py (in write, read, all_contexts)
  • Why it matters: The cache operations wrap their execution in a broad except Exception: pass (or returning None). While making the cache "best effort" is a good architectural choice, silently catching all Exceptions suppresses logical bugs (like NameError, TypeError) alongside legitimate OSErrors (like full disks or permission issues).
  • How to fix: Narrow the exception scope to catch only expected runtime failure conditions, making debugging significantly easier if the code breaks in the future.
def read(context: str) -> dict | None:
  # ...
  try:
    p = _path_for(context)
    if not p.exists():
      return None
    payload = json.loads(p.read_text())
    return payload if isinstance(payload, dict) else None
  except (OSError, json.JSONDecodeError):
    return None

workload_create_pathways_parser.set_defaults(func=workload_create_pathways)


def set_workload_delete_parser(workload_delete_parser: ArgumentParser):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see it now why it worked for ParserOrArgumentGroup, it just types the object to Any after add_argument...

Sorry for this confusion, could you please revert removing the types and go back to the original solution (add mypy bypassing comments)? That was the cleanest approach of all that we have available.

Comment thread src/xpk/parser/workload.py Outdated
Copy link
Copy Markdown
Member

@scaliby scaliby left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is too big to be reviewed and carries a significant risk. Please work on splitting it into small, incremental PR's that are easy to reason about.

' --priority. Valid teams are discovered at runtime from'
" the cluster's team-quota ConfigMap."
),
).completer = _cached_field_completer('teams')
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move field caching here and below to a separate PR and make it consistent with other arguments. Otherwise drop this.

.completer = _cached_field_completer('teams')

from .validators import directory_path_type, name_type


def _cached_field_completer(field: str) -> Callable[..., list[str]]:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please move this to a separate PR. This is not related to team-based routing

cfg = _sample_cfg()
mocker.patch('xpk.core.quota_discovery.fetch_quota_config', return_value=cfg)
mocker.patch('xpk.core.quota_discovery.xpk_print')
mocker.patch('xpk.core.quota_discovery.xpk_exit', side_effect=SystemExit(1))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xpk_exit by default raises SystemExit. Is this mock necessary?

must fail loud, not silently emit a workload missing the duration label."""
cfg = _sample_cfg()
mocker.patch('xpk.core.quota_discovery.fetch_quota_config', return_value=cfg)
mocker.patch('xpk.core.quota_discovery.xpk_print')
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? I think it's fine to just print some stuff as a part of the test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants