Skip to content

feat(sagemaker-ai): add HyperPod debugger skills#158

Open
badmin-aws wants to merge 2 commits into
awslabs:mainfrom
badmin-aws:feat/hyperpod-debugger-skills
Open

feat(sagemaker-ai): add HyperPod debugger skills#158
badmin-aws wants to merge 2 commits into
awslabs:mainfrom
badmin-aws:feat/hyperpod-debugger-skills

Conversation

@badmin-aws
Copy link
Copy Markdown

@badmin-aws badmin-aws commented May 5, 2026

Summary

  • Add six new diagnostic skills for SageMaker HyperPod: cluster-debugger, mfu-debugger, nccl, node-debugger, performance-debugger, slurm-debugger
  • Fix minor issues in existing hyperpod-issue-report and hyperpod-version-checker skills
  • Bump sagemaker-ai plugin version to 1.1.1

New Skills

Skill Purpose
hyperpod-cluster-debugger Cluster-wide lifecycle issues (EKS/Slurm) — creation failures, EFA, capacity, EKS access, AMI upgrades
hyperpod-mfu-debugger MFU degradation triage — 7 root-cause paths (regression, errors, straggler, thermal, periodic, network, config)
hyperpod-nccl NCCL failure diagnosis — timeouts, EFA/libfabric errors, collective-op aborts
hyperpod-node-debugger Per-node health — GPU/ECC, disk, OOM, lifecycle, SSM, container runtime
hyperpod-performance-debugger Narrow triage for uneven NCCL, filesystem throughput, GPU failure
hyperpod-slurm-debugger Slurm node-state management — down/drain recovery, controller restart

Test plan

  • mise run build passes (lint:md, fmt:check, lint:manifests, lint:cross-refs, validate:refs, security)
  • Manual validation with a live HyperPod cluster
  • Verify skill auto-trigger on representative user prompts

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of the project license.

@badmin-aws badmin-aws requested review from a team as code owners May 5, 2026 04:17
@badmin-aws badmin-aws added the do-not-merge Do not merge the pull request label May 5, 2026
@badmin-aws badmin-aws self-assigned this May 5, 2026
Copy link
Copy Markdown

@github-advanced-security github-advanced-security AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Semgrep OSS found more than 20 potential problems in the proposed changes. Check the Files changed tab for more details.

@badmin-aws badmin-aws force-pushed the feat/hyperpod-debugger-skills branch from c7905f0 to 9b1c98e Compare May 5, 2026 04:59
…ll issues

Add six new diagnostic skills for SageMaker HyperPod:
- hyperpod-cluster-debugger: cluster-wide lifecycle issues (EKS/Slurm)
- hyperpod-mfu-debugger: MFU degradation triage
- hyperpod-nccl: NCCL failure diagnosis
- hyperpod-node-debugger: per-node health issues
- hyperpod-performance-debugger: NCCL bandwidth, filesystem, GPU failures
- hyperpod-slurm-debugger: Slurm node-state management

Also fixes:
- hyperpod-issue-report: simplify SSM troubleshooting message
- hyperpod-version-checker: remove unused RED color var, replace IS_GPU
  flag with runtime nvidia-smi detection

Bumps sagemaker-ai plugin version to 1.1.1.
@badmin-aws badmin-aws force-pushed the feat/hyperpod-debugger-skills branch from 9b1c98e to ef84ee3 Compare May 5, 2026 05:06
IS_GPU=false
[[ "$INSTANCE_TYPE" =~ (^|\.)(trn|inf) ]] && IS_NEURON=true
[[ "$INSTANCE_TYPE" =~ (^|\.)(p[0-9]|g[0-9]) ]] && IS_GPU=true
# GPU detection is driven by `cmd_exists nvidia-smi` at each GPU section below —
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the 3-line comment explaining the deleted IS_GPU flag, the code it refers to no longer exists.

done

CLUSTER=""
REGION="${AWS_DEFAULT_REGION:-us-east-1}"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

diagnose-cluster.sh and slurm-diagnose-fix.sh silently default to us-east-1, nccl-diagnose.sh exits when region is unset. Align all three on the explicit-exit behavior to avoid running diagnostics against the wrong region.

set -euo pipefail

# Prerequisite tools. jq is a hard requirement per skill-creator safety rules.
for cmd in aws jq python3; do
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each script re-implements the "check required tools exist" loop. Minor, but a shared require_tools aws jq python3 helper in the future common.sh would be cleaner.

(hyperpod-slurm-debugger), or uneven-NCCL, FSx, or GPU-failure
triage (hyperpod-performance-debugger).
metadata:
version: "1.0.0"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All six new skills ship at "1.0.0". Fine for the first cut, but worth deciding upfront whether skill-level versions track the plugin version or evolve independently, so future bumps have a rule to follow.

local slurm_node="$1"
local instance_id group ssm_target
instance_id=$(echo "$NODES_JSON" | jq -r --arg dns "$slurm_node" '
.ClusterNodeSummaries[] | select(.PrivateDnsName|startswith($dns)) | .InstanceId' | head -1)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mapping Slurm node name to instance via startswith on PrivateDnsName can match multiple nodes if names share a prefix (e.g. ip-10-0-1-1 vs ip-10-0-1-10). Consider exact match on the hostname component.

[[ -z "$d" ]] && continue
IFS='|' read -r sn _ tgt sstatus disk_pct <<< "$d"
[[ "$sn" == "$node" ]] || continue
disk_num=$(echo "$disk_pct" | tr -d '%')
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assumes df -h always returns a percentage like 95%. On unusual filesystems or errors, disk_num could be empty or non-numeric, the [[ -n ... -ge 95 ]] guard helps, but a clearer "unable to determine disk usage" branch would be better.

Copy link
Copy Markdown

@ssuday ssuday left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left comments

@awsawsawsawsawsawsaws
Copy link
Copy Markdown

We're conducting an internal review. Please do NOT merge this.

@krokoko
Copy link
Copy Markdown
Contributor

krokoko commented May 8, 2026

Auomtated review output:

Scope: +17,143 / -6 lines — 6 new diagnostic skills, 8 shell scripts, minor fixes to 2 existing skills


Critical Issues (3 found)

┌─────┬──────────────────────────┬───────────────────────────────┬─────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ # │ Source │ File │ Issue │
├─────┼──────────────────────────┼───────────────────────────────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ │ code-reviewer, │ │ Duplicated IAM policy check block — the entire "Managed policies" validation block (ListAttachedRolePolicies + │
│ 1 │ comment-analyzer │ scripts/diagnose-cluster.sh │ AmazonSageMakerClusterInstanceRolePolicy + AmazonSSMManagedInstanceCore checks) is copy-pasted verbatim, │
│ │ │ │ producing duplicate output and redundant API calls │
├─────┼──────────────────────────┼───────────────────────────────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ 2 │ silent-failure-hunter │ scripts/perf-snapshot.sh │ Missing -e flag — uses set -uo pipefail instead of set -euo pipefail. All 7 other scripts include -e. Without │
│ │ │ │ it, failed commands don't abort execution, allowing cascading silent failures │
├─────┼──────────────────────────┼───────────────────────────────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ 3 │ silent-failure-hunter │ scripts/slurm-diagnose-fix.sh │ ssm_run() has no timeout and swallows all errors — unlike other scripts that use timeout 180, this function can │
│ │ │ │ hang indefinitely. Combined with 2>/dev/null and ` │
└─────┴──────────────────────────┴───────────────────────────────┴─────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘


Important Issues (7 found)

┌─────┬───────────────────────┬─────────────────────────────────────────┬──────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ # │ Source │ Location │ Issue │
├─────┼───────────────────────┼─────────────────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ 4 │ comment-analyzer │ hyperpod-cluster-debugger/SKILL.md │ Error Handling table references "SSM send-command throttled" but the script uses start-session (same │
│ │ │ │ issue in hyperpod-node-debugger/SKILL.md) │
├─────┼───────────────────────┼─────────────────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ 5 │ comment-analyzer │ hyperpod-performance-debugger/SKILL.md │ Claims --node-ids takes "space-separated instance IDs, not a JSON array" — contradicts all other skills │
│ │ │ │ which use JSON array format '[""]' │
├─────┼───────────────────────┼─────────────────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ 6 │ silent-failure-hunter │ scripts/slurm-diagnose-fix.sh │ No pagination on list-cluster-nodes — default page size is 10 nodes. On clusters with >10 nodes, │
│ │ │ │ controller node may not be found and --fix actions target wrong nodes │
├─────┼───────────────────────┼─────────────────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ 7 │ silent-failure-hunter │ scripts/check-efa-sg.sh │ Missing input validation on --cluster — unlike all other scripts that regex-validate cluster name/ARN, │
│ │ │ │ this one passes raw input to AWS API │
├─────┼───────────────────────┼─────────────────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ 8 │ silent-failure-hunter │ scripts/check-vpc-config.sh │ EC2 API calls fall back to empty JSON ('{"Subnets":[]}') without notification — permission denial │
│ │ │ │ manifests as false-positive [FAIL] for DNS support │
├─────┼───────────────────────┼─────────────────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ 9 │ comment-analyzer │ hyperpod-cluster-debugger/SKILL.md │ Claims "paginated via --no-paginate / NextToken up to 5000 nodes" but script actually uses --max-results │
│ │ │ Defaults │ 100 with 200 pages (20,000 nodes) │
├─────┼───────────────────────┼─────────────────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ │ │ │ Example shows --filters "Name=instance-type,Values=ml.p5.48xlarge" — EC2 API doesn't recognize the ml. │
│ 10 │ comment-analyzer │ capacity-planning.md │ prefix (should be p5.48xlarge). The script correctly strips this prefix, but the docs show the wrong │
│ │ │ │ value │
└─────┴───────────────────────┴─────────────────────────────────────────┴──────────────────────────────────────────────────────────────────────────────────────────────────────────┘


Suggestions (5 found)

┌─────┬───────────────────────┬────────────────────────────────┬────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ # │ Source │ Location │ Issue │
├─────┼───────────────────────┼────────────────────────────────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ 11 │ silent-failure-hunter │ All scripts │ python3 -c "..." 2>/dev/null │ ├─────┼───────────────────────┼────────────────────────────────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤ │ 12 │ silent-failure-hunter │ All pagination loops │
├─────┼───────────────────────┼────────────────────────────────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ 13 │ comment-analyzer │ hyperpod-mfu-debugger/SKILL.md │ References scripts/get-cluster-info.sh as if local, but these live in the hyperpod-ssm skill │
├─────┼───────────────────────┼────────────────────────────────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ 14 │ silent-failure-hunter │ check-vpc-config.sh │ DNS check should handle "unknown" explicitly with a warn, not produce a [FAIL] │
├─────┼───────────────────────┼────────────────────────────────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ 15 │ code-reviewer │ diagnose-cluster.sh │ fetch_cluster_events_cd passes accumulated JSON via sys.argv — could hit ARG_MAX on pathological clusters (unlike │
│ │ │ │ fetch_all_cluster_nodes_cd which uses stdin) │
└─────┴───────────────────────┴────────────────────────────────┴────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘


Strengths

  • Consistent architecture — All 6 skills faithfully follow the "scripts are read-only signal collectors, references hold remediation" separation of concerns
  • Excellent skill descriptions — keyword-rich with positive triggers and negative boundaries ("Do NOT use for X → use Y"), creating a coherent delegation network
  • Strong input validation — regex validation on cluster names, regions, instance IDs across most scripts
  • SSM retry logic — diagnose-cluster.sh, nccl-diagnose.sh, and triage-cluster.sh have exemplary retry logic that distinguishes transient vs. fatal errors
  • Project structure compliance — follows plugins//skills//SKILL.md with references/ and scripts/ perfectly
  • Read-only guarantee upheld — verified that diagnostic scripts only call describe-, list-, get-* APIs; slurm-diagnose-fix.sh properly gates mutations behind --fix
  • Version bump is defensible — 1.1.0 → 1.1.1 for additive diagnostic skills (though 1.2.0 would also be appropriate)

Recommended Action

  1. Fix critical issues first — remove the duplicated block in diagnose-cluster.sh, add -e to perf-snapshot.sh, add timeout + error checking to slurm-diagnose-fix.sh's ssm_run()
  2. Address documentation inaccuracies — fix send-command → start-session in Error Handling tables, correct the pagination claim in Defaults, fix the ml. prefix in capacity-planning.md
  3. Add pagination to slurm-diagnose-fix.sh and input validation to check-efa-sg.sh for parity with other scripts
  4. Consider the medium-severity suggestions for improved user experience on large clusters

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge Do not merge the pull request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants