feat(sagemaker-ai): add HyperPod debugger skills by badmin-aws · Pull Request #158 · awslabs/agent-plugins

badmin-aws · 2026-05-05T04:17:33Z

Summary

Add six new diagnostic skills for SageMaker HyperPod: cluster-debugger, mfu-debugger, nccl, node-debugger, performance-debugger, slurm-debugger
Fix minor issues in existing hyperpod-issue-report and hyperpod-version-checker skills
Bump sagemaker-ai plugin version to 1.1.1

New Skills

Skill	Purpose
`hyperpod-cluster-debugger`	Cluster-wide lifecycle issues (EKS/Slurm) — creation failures, EFA, capacity, EKS access, AMI upgrades
`hyperpod-mfu-debugger`	MFU degradation triage — 7 root-cause paths (regression, errors, straggler, thermal, periodic, network, config)
`hyperpod-nccl`	NCCL failure diagnosis — timeouts, EFA/libfabric errors, collective-op aborts
`hyperpod-node-debugger`	Per-node health — GPU/ECC, disk, OOM, lifecycle, SSM, container runtime
`hyperpod-performance-debugger`	Narrow triage for uneven NCCL, filesystem throughput, GPU failure
`hyperpod-slurm-debugger`	Slurm node-state management — down/drain recovery, controller restart

Test plan

mise run build passes (lint:md, fmt:check, lint:manifests, lint:cross-refs, validate:refs, security)
Manual validation with a live HyperPod cluster
Verify skill auto-trigger on representative user prompts

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of the project license.

github-advanced-security

Semgrep OSS found more than 20 potential problems in the proposed changes. Check the Files changed tab for more details.

…ll issues Add six new diagnostic skills for SageMaker HyperPod: - hyperpod-cluster-debugger: cluster-wide lifecycle issues (EKS/Slurm) - hyperpod-mfu-debugger: MFU degradation triage - hyperpod-nccl: NCCL failure diagnosis - hyperpod-node-debugger: per-node health issues - hyperpod-performance-debugger: NCCL bandwidth, filesystem, GPU failures - hyperpod-slurm-debugger: Slurm node-state management Also fixes: - hyperpod-issue-report: simplify SSM troubleshooting message - hyperpod-version-checker: remove unused RED color var, replace IS_GPU flag with runtime nvidia-smi detection Bumps sagemaker-ai plugin version to 1.1.1.

ssuday · 2026-05-07T22:00:11Z

-IS_GPU=false
 [[ "$INSTANCE_TYPE" =~ (^|\.)(trn|inf) ]] && IS_NEURON=true
-[[ "$INSTANCE_TYPE" =~ (^|\.)(p[0-9]|g[0-9]) ]] && IS_GPU=true
+# GPU detection is driven by `cmd_exists nvidia-smi` at each GPU section below —


Remove the 3-line comment explaining the deleted IS_GPU flag, the code it refers to no longer exists.

ssuday · 2026-05-07T22:01:34Z

+done
+
+CLUSTER=""
+REGION="${AWS_DEFAULT_REGION:-us-east-1}"


diagnose-cluster.sh and slurm-diagnose-fix.sh silently default to us-east-1, nccl-diagnose.sh exits when region is unset. Align all three on the explicit-exit behavior to avoid running diagnostics against the wrong region.

ssuday · 2026-05-07T22:02:49Z

+set -euo pipefail
+
+# Prerequisite tools. jq is a hard requirement per skill-creator safety rules.
+for cmd in aws jq python3; do


Each script re-implements the "check required tools exist" loop. Minor, but a shared require_tools aws jq python3 helper in the future common.sh would be cleaner.

ssuday · 2026-05-07T22:03:31Z

+  (hyperpod-slurm-debugger), or uneven-NCCL, FSx, or GPU-failure
+  triage (hyperpod-performance-debugger).
+metadata:
+  version: "1.0.0"


All six new skills ship at "1.0.0". Fine for the first cut, but worth deciding upfront whether skill-level versions track the plugin version or evolve independently, so future bumps have a rule to follow.

ssuday · 2026-05-07T22:06:28Z

+  local slurm_node="$1"
+  local instance_id group ssm_target
+  instance_id=$(echo "$NODES_JSON" | jq -r --arg dns "$slurm_node" '
+    .ClusterNodeSummaries[] | select(.PrivateDnsName|startswith($dns)) | .InstanceId' | head -1)


Mapping Slurm node name to instance via startswith on PrivateDnsName can match multiple nodes if names share a prefix (e.g. ip-10-0-1-1 vs ip-10-0-1-10). Consider exact match on the hostname component.

ssuday · 2026-05-07T22:07:09Z

+    [[ -z "$d" ]] && continue
+    IFS='|' read -r sn _ tgt sstatus disk_pct <<< "$d"
+    [[ "$sn" == "$node" ]] || continue
+    disk_num=$(echo "$disk_pct" | tr -d '%')


Assumes df -h always returns a percentage like 95%. On unusual filesystems or errors, disk_num could be empty or non-numeric, the [[ -n ... -ge 95 ]] guard helps, but a clearer "unable to determine disk usage" branch would be better.

ssuday

Left comments

awsawsawsawsawsawsaws · 2026-05-07T22:19:12Z

We're conducting an internal review. Please do NOT merge this.

krokoko · 2026-05-08T18:04:50Z

Auomtated review output:

Scope: +17,143 / -6 lines — 6 new diagnostic skills, 8 shell scripts, minor fixes to 2 existing skills

Critical Issues (3 found)

┌─────┬──────────────────────────┬───────────────────────────────┬─────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ # │ Source │ File │ Issue │
├─────┼──────────────────────────┼───────────────────────────────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ │ code-reviewer, │ │ Duplicated IAM policy check block — the entire "Managed policies" validation block (ListAttachedRolePolicies + │
│ 1 │ comment-analyzer │ scripts/diagnose-cluster.sh │ AmazonSageMakerClusterInstanceRolePolicy + AmazonSSMManagedInstanceCore checks) is copy-pasted verbatim, │
│ │ │ │ producing duplicate output and redundant API calls │
├─────┼──────────────────────────┼───────────────────────────────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ 2 │ silent-failure-hunter │ scripts/perf-snapshot.sh │ Missing -e flag — uses set -uo pipefail instead of set -euo pipefail. All 7 other scripts include -e. Without │
│ │ │ │ it, failed commands don't abort execution, allowing cascading silent failures │
├─────┼──────────────────────────┼───────────────────────────────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ 3 │ silent-failure-hunter │ scripts/slurm-diagnose-fix.sh │ ssm_run() has no timeout and swallows all errors — unlike other scripts that use timeout 180, this function can │
│ │ │ │ hang indefinitely. Combined with 2>/dev/null and ` │
└─────┴──────────────────────────┴───────────────────────────────┴─────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

Important Issues (7 found)

┌─────┬───────────────────────┬─────────────────────────────────────────┬──────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ # │ Source │ Location │ Issue │
├─────┼───────────────────────┼─────────────────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ 4 │ comment-analyzer │ hyperpod-cluster-debugger/SKILL.md │ Error Handling table references "SSM send-command throttled" but the script uses start-session (same │
│ │ │ │ issue in hyperpod-node-debugger/SKILL.md) │
├─────┼───────────────────────┼─────────────────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ 5 │ comment-analyzer │ hyperpod-performance-debugger/SKILL.md │ Claims --node-ids takes "space-separated instance IDs, not a JSON array" — contradicts all other skills │
│ │ │ │ which use JSON array format '[""]' │
├─────┼───────────────────────┼─────────────────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ 6 │ silent-failure-hunter │ scripts/slurm-diagnose-fix.sh │ No pagination on list-cluster-nodes — default page size is 10 nodes. On clusters with >10 nodes, │
│ │ │ │ controller node may not be found and --fix actions target wrong nodes │
├─────┼───────────────────────┼─────────────────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ 7 │ silent-failure-hunter │ scripts/check-efa-sg.sh │ Missing input validation on --cluster — unlike all other scripts that regex-validate cluster name/ARN, │
│ │ │ │ this one passes raw input to AWS API │
├─────┼───────────────────────┼─────────────────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ 8 │ silent-failure-hunter │ scripts/check-vpc-config.sh │ EC2 API calls fall back to empty JSON ('{"Subnets":[]}') without notification — permission denial │
│ │ │ │ manifests as false-positive [FAIL] for DNS support │
├─────┼───────────────────────┼─────────────────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ 9 │ comment-analyzer │ hyperpod-cluster-debugger/SKILL.md │ Claims "paginated via --no-paginate / NextToken up to 5000 nodes" but script actually uses --max-results │
│ │ │ Defaults │ 100 with 200 pages (20,000 nodes) │
├─────┼───────────────────────┼─────────────────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ │ │ │ Example shows --filters "Name=instance-type,Values=ml.p5.48xlarge" — EC2 API doesn't recognize the ml. │
│ 10 │ comment-analyzer │ capacity-planning.md │ prefix (should be p5.48xlarge). The script correctly strips this prefix, but the docs show the wrong │
│ │ │ │ value │
└─────┴───────────────────────┴─────────────────────────────────────────┴──────────────────────────────────────────────────────────────────────────────────────────────────────────┘

Suggestions (5 found)

┌─────┬───────────────────────┬────────────────────────────────┬────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ # │ Source │ Location │ Issue │
├─────┼───────────────────────┼────────────────────────────────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ 11 │ silent-failure-hunter │ All scripts │ python3 -c "..." 2>/dev/null │ ├─────┼───────────────────────┼────────────────────────────────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤ │ 12 │ silent-failure-hunter │ All pagination loops │ │
├─────┼───────────────────────┼────────────────────────────────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ 13 │ comment-analyzer │ hyperpod-mfu-debugger/SKILL.md │ References scripts/get-cluster-info.sh as if local, but these live in the hyperpod-ssm skill │
├─────┼───────────────────────┼────────────────────────────────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ 14 │ silent-failure-hunter │ check-vpc-config.sh │ DNS check should handle "unknown" explicitly with a warn, not produce a [FAIL] │
├─────┼───────────────────────┼────────────────────────────────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ 15 │ code-reviewer │ diagnose-cluster.sh │ fetch_cluster_events_cd passes accumulated JSON via sys.argv — could hit ARG_MAX on pathological clusters (unlike │
│ │ │ │ fetch_all_cluster_nodes_cd which uses stdin) │
└─────┴───────────────────────┴────────────────────────────────┴────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

Strengths

Consistent architecture — All 6 skills faithfully follow the "scripts are read-only signal collectors, references hold remediation" separation of concerns
Excellent skill descriptions — keyword-rich with positive triggers and negative boundaries ("Do NOT use for X → use Y"), creating a coherent delegation network
Strong input validation — regex validation on cluster names, regions, instance IDs across most scripts
SSM retry logic — diagnose-cluster.sh, nccl-diagnose.sh, and triage-cluster.sh have exemplary retry logic that distinguishes transient vs. fatal errors
Project structure compliance — follows plugins//skills//SKILL.md with references/ and scripts/ perfectly
Read-only guarantee upheld — verified that diagnostic scripts only call describe-, list-, get-* APIs; slurm-diagnose-fix.sh properly gates mutations behind --fix
Version bump is defensible — 1.1.0 → 1.1.1 for additive diagnostic skills (though 1.2.0 would also be appropriate)

Recommended Action

Fix critical issues first — remove the duplicated block in diagnose-cluster.sh, add -e to perf-snapshot.sh, add timeout + error checking to slurm-diagnose-fix.sh's ssm_run()
Address documentation inaccuracies — fix send-command → start-session in Error Handling tables, correct the pagination claim in Defaults, fix the ml. prefix in capacity-planning.md
Add pagination to slurm-diagnose-fix.sh and input validation to check-efa-sg.sh for parity with other scripts
Consider the medium-severity suggestions for improved user experience on large clusters

badmin-aws requested review from a team, krokoko, scottschreckengaust and theagenticguy May 5, 2026 04:17

badmin-aws requested review from a team as code owners May 5, 2026 04:17

badmin-aws added the do-not-merge Do not merge the pull request label May 5, 2026

badmin-aws self-assigned this May 5, 2026

github-advanced-security AI found potential problems May 5, 2026

View reviewed changes

badmin-aws force-pushed the feat/hyperpod-debugger-skills branch from c7905f0 to 9b1c98e Compare May 5, 2026 04:59

github-advanced-security AI found potential problems May 5, 2026

View reviewed changes

Comment thread plugins/sagemaker-ai/skills/hyperpod-node-debugger/scripts/triage-cluster.sh Fixed

Comment thread plugins/sagemaker-ai/skills/hyperpod-node-debugger/scripts/triage-cluster.sh Fixed

badmin-aws force-pushed the feat/hyperpod-debugger-skills branch from 9b1c98e to ef84ee3 Compare May 5, 2026 05:06

ssuday reviewed May 7, 2026

View reviewed changes

Merge branch 'main' into feat/hyperpod-debugger-skills

880dddf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(sagemaker-ai): add HyperPod debugger skills#158

feat(sagemaker-ai): add HyperPod debugger skills#158
badmin-aws wants to merge 2 commits into
awslabs:mainfrom
badmin-aws:feat/hyperpod-debugger-skills

badmin-aws commented May 5, 2026 •

edited

Loading

Uh oh!

github-advanced-security AI left a comment

Uh oh!

Uh oh!

Uh oh!

ssuday May 7, 2026

Uh oh!

ssuday May 7, 2026

Uh oh!

ssuday May 7, 2026

Uh oh!

ssuday May 7, 2026

Uh oh!

ssuday May 7, 2026

Uh oh!

ssuday May 7, 2026

Uh oh!

ssuday left a comment

Uh oh!

awsawsawsawsawsawsaws commented May 7, 2026

Uh oh!

krokoko commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

badmin-aws commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

New Skills

Test plan

Uh oh!

github-advanced-security AI left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ssuday May 7, 2026

Choose a reason for hiding this comment

Uh oh!

ssuday May 7, 2026

Choose a reason for hiding this comment

Uh oh!

ssuday May 7, 2026

Choose a reason for hiding this comment

Uh oh!

ssuday May 7, 2026

Choose a reason for hiding this comment

Uh oh!

ssuday May 7, 2026

Choose a reason for hiding this comment

Uh oh!

ssuday May 7, 2026

Choose a reason for hiding this comment

Uh oh!

ssuday left a comment

Choose a reason for hiding this comment

Uh oh!

awsawsawsawsawsawsaws commented May 7, 2026

Uh oh!

krokoko commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

badmin-aws commented May 5, 2026 •

edited

Loading