feat(sagemaker-ai): add HyperPod debugger skills#158
Conversation
There was a problem hiding this comment.
Semgrep OSS found more than 20 potential problems in the proposed changes. Check the Files changed tab for more details.
c7905f0 to
9b1c98e
Compare
…ll issues Add six new diagnostic skills for SageMaker HyperPod: - hyperpod-cluster-debugger: cluster-wide lifecycle issues (EKS/Slurm) - hyperpod-mfu-debugger: MFU degradation triage - hyperpod-nccl: NCCL failure diagnosis - hyperpod-node-debugger: per-node health issues - hyperpod-performance-debugger: NCCL bandwidth, filesystem, GPU failures - hyperpod-slurm-debugger: Slurm node-state management Also fixes: - hyperpod-issue-report: simplify SSM troubleshooting message - hyperpod-version-checker: remove unused RED color var, replace IS_GPU flag with runtime nvidia-smi detection Bumps sagemaker-ai plugin version to 1.1.1.
9b1c98e to
ef84ee3
Compare
| IS_GPU=false | ||
| [[ "$INSTANCE_TYPE" =~ (^|\.)(trn|inf) ]] && IS_NEURON=true | ||
| [[ "$INSTANCE_TYPE" =~ (^|\.)(p[0-9]|g[0-9]) ]] && IS_GPU=true | ||
| # GPU detection is driven by `cmd_exists nvidia-smi` at each GPU section below — |
There was a problem hiding this comment.
Remove the 3-line comment explaining the deleted IS_GPU flag, the code it refers to no longer exists.
| done | ||
|
|
||
| CLUSTER="" | ||
| REGION="${AWS_DEFAULT_REGION:-us-east-1}" |
There was a problem hiding this comment.
diagnose-cluster.sh and slurm-diagnose-fix.sh silently default to us-east-1, nccl-diagnose.sh exits when region is unset. Align all three on the explicit-exit behavior to avoid running diagnostics against the wrong region.
| set -euo pipefail | ||
|
|
||
| # Prerequisite tools. jq is a hard requirement per skill-creator safety rules. | ||
| for cmd in aws jq python3; do |
There was a problem hiding this comment.
Each script re-implements the "check required tools exist" loop. Minor, but a shared require_tools aws jq python3 helper in the future common.sh would be cleaner.
| (hyperpod-slurm-debugger), or uneven-NCCL, FSx, or GPU-failure | ||
| triage (hyperpod-performance-debugger). | ||
| metadata: | ||
| version: "1.0.0" |
There was a problem hiding this comment.
All six new skills ship at "1.0.0". Fine for the first cut, but worth deciding upfront whether skill-level versions track the plugin version or evolve independently, so future bumps have a rule to follow.
| local slurm_node="$1" | ||
| local instance_id group ssm_target | ||
| instance_id=$(echo "$NODES_JSON" | jq -r --arg dns "$slurm_node" ' | ||
| .ClusterNodeSummaries[] | select(.PrivateDnsName|startswith($dns)) | .InstanceId' | head -1) |
There was a problem hiding this comment.
Mapping Slurm node name to instance via startswith on PrivateDnsName can match multiple nodes if names share a prefix (e.g. ip-10-0-1-1 vs ip-10-0-1-10). Consider exact match on the hostname component.
| [[ -z "$d" ]] && continue | ||
| IFS='|' read -r sn _ tgt sstatus disk_pct <<< "$d" | ||
| [[ "$sn" == "$node" ]] || continue | ||
| disk_num=$(echo "$disk_pct" | tr -d '%') |
There was a problem hiding this comment.
Assumes df -h always returns a percentage like 95%. On unusual filesystems or errors, disk_num could be empty or non-numeric, the [[ -n ... -ge 95 ]] guard helps, but a clearer "unable to determine disk usage" branch would be better.
|
We're conducting an internal review. Please do NOT merge this. |
|
Auomtated review output: Scope: +17,143 / -6 lines — 6 new diagnostic skills, 8 shell scripts, minor fixes to 2 existing skills Critical Issues (3 found) ┌─────┬──────────────────────────┬───────────────────────────────┬─────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐ Important Issues (7 found) ┌─────┬───────────────────────┬─────────────────────────────────────────┬──────────────────────────────────────────────────────────────────────────────────────────────────────────┐ Suggestions (5 found) ┌─────┬───────────────────────┬────────────────────────────────┬────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐ Strengths
Recommended Action
|
Summary
New Skills
hyperpod-cluster-debuggerhyperpod-mfu-debuggerhyperpod-ncclhyperpod-node-debuggerhyperpod-performance-debuggerhyperpod-slurm-debuggerTest plan
mise run buildpasses (lint:md, fmt:check, lint:manifests, lint:cross-refs, validate:refs, security)By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of the project license.