Skip to content

Add Claude Code automation layer for MAD (commands, agents, workflows)#160

Draft
coketaste wants to merge 6 commits into
ROCm:developfrom
coketaste:coketaste/claude-workflow
Draft

Add Claude Code automation layer for MAD (commands, agents, workflows)#160
coketaste wants to merge 6 commits into
ROCm:developfrom
coketaste:coketaste/claude-workflow

Conversation

@coketaste
Copy link
Copy Markdown
Contributor

Summary

Adds a Claude Code automation layer on top of `madengine` so common MAD
tasks — benchmarking, adding models, profiling, tuning, and analysis — can be
driven by slash commands and multi-agent workflows instead of hand-assembled
CLI invocations.

- **Slash commands + agents**: `/mad-benchmark`, `/mad-add-model`, `/mad-profile`,
  `/mad-report`, `/mad-tune`, `/mad-validate`, backed by dedicated subagents
  (benchmark-runner, model-author, perf-analyst, tuner).
- **Workflows**: `mad-benchmark-sweep` (fan out a benchmark matrix across tags,
  parallel, synthesize a comparison table) and `mad-tune-search` (profile-driven
  tuning search with adversarial verification).
- **Static validator** for `models.json` entries (JSON, paths, Dockerfile header,
  performance-line contract) — no GPU required.
- **How-to docs**: `mad-automation-howto.html`.

## Key design decisions

- **Workflows accept the same CLI-style flags as `/mad-benchmark`**
  (`--tags`, `--additional-context`, `--ngpus`, `--timeout`, `--plan`), and thread
  `--additional-context` (HF token, `docker_env_vars`) into every cell. They
  execute by default; `--plan`/`--no-execute` for a GPU-free dry run.
- **`mad-benchmark-sweep` runs cells in parallel** with per-cell `perf_<tag>.csv`
  isolation (no shared-file clobbering); precision is intentionally not a sweep
  axis (it's fixed per model in `models.json`).
- **`mad-tune-search` is profile-driven**: a Diagnose phase profiles the model
  once to classify the bottleneck (compute/memory/communication/launch) from
  kernel hotspots, proposals must cite that evidence, and candidates are then
  measured **cleanly and sequentially** (profiler stripped) so deltas aren't
  buried in tracing overhead or corrupted by GPU contention / parallel file edits.
- Both workflows handle `multiple_results` models (per-metric CSV instead of a
  stdout `performance:` line).

## Test plan

- [ ] `madengine discover --tags <tag>` resolves for the documented examples
- [ ] `/mad-benchmark-sweep --tags ... --plan` validates without a GPU
- [ ] `/mad-benchmark-sweep --tags <tag> --additional-context '{...}'` executes
      on a GPU host and writes `perf_<tag>.csv`
- [ ] `/mad-tune-search --tag <model> --plan` produces a profiling-informed
      candidate list
- [ ] `/mad-tune-search --tag <model> --additional-context '{...}'` diagnoses
      the bottleneck and measures candidates cleanly
- [ ] `/mad-validate` passes for all `models.json` entries
- [ ] `mad-automation-howto.html` renders and matches current workflow behavior

coketaste and others added 3 commits May 31, 2026 12:51
Add a foundation pack so common MAD tasks (benchmarking, adding models,
tuning, development) can be driven through Claude Code with the repo's
conventions baked in:

- CLAUDE.md: models.json schema, 4-step add-model flow, the
  "performance: <value> <unit>" stdout contract, madengine v2.1.0 CLI
  commands, deployment inference, and profiling.
- .claude/agents/: mad-model-author, mad-perf-analyst, mad-benchmark-runner,
  mad-tuner.
- .claude/commands/: mad-add-model, mad-benchmark, mad-profile, mad-report,
  mad-tune.
- .claude/workflows/: mad-benchmark-sweep and mad-tune-search dynamic
  workflows (plan-only by default; execute:true on a GPU host).
- .claude/settings.json: shared read-only/common command allowlist.

Commands and docs verified against the installed madengine Typer CLI v2.1.0
(@main): top-level build/run/discover/report/database.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
- mad-benchmark-sweep: drop the non-functional precision axis (precision is
  fixed per-model via training_precision or baked into the inference image, not
  a runtime flag) and give each cell its own perf_<cell>.csv to avoid clobbering
  a shared perf.csv under parallel execution; flag unresolved/errored cells.
- mad-model-author: document the multiple_results output contract alongside the
  performance: stdout line, and confirm new entries resolve via discover.
- Add /mad-validate: GPU-free static checker (JSON, paths, Dockerfile CONTEXT
  header, output contract) with errors vs convention-warning severities.
- Point profiling agent/command at scripts/common/tools.json as the source of
  truth and document the deploy-key convention.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
@coketaste coketaste self-assigned this Jun 1, 2026
Copilot AI review requested due to automatic review settings June 1, 2026 02:43
@coketaste coketaste marked this pull request as draft June 1, 2026 02:43
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a Claude Code automation layer for MAD so common benchmarking, profiling, reporting, tuning, validation, and model-authoring tasks can be driven through slash commands, specialized agents, and two workflow scripts.

Changes:

  • Adds Claude Code command prompts, agent definitions, workflow scripts, and permission settings under .claude/.
  • Adds repository guidance in CLAUDE.md for MAD conventions and madengine usage.
  • Adds a standalone HTML how-to guide documenting the automation layer.

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
.claude/agents/mad-benchmark-runner.md Defines benchmark/profile command assembly and execution behavior.
.claude/agents/mad-model-author.md Defines model scaffolding guidance for new MAD entries.
.claude/agents/mad-perf-analyst.md Defines read-only benchmark result analysis behavior.
.claude/agents/mad-tuner.md Defines iterative tuning behavior.
.claude/commands/mad-add-model.md Adds slash command prompt for adding models.
.claude/commands/mad-benchmark.md Adds slash command prompt for benchmark runs.
.claude/commands/mad-profile.md Adds slash command prompt for profiled benchmark runs.
.claude/commands/mad-report.md Adds slash command prompt for result analysis.
.claude/commands/mad-tune.md Adds slash command prompt for tuning.
.claude/commands/mad-validate.md Adds static validation command for MAD model entries.
.claude/settings.json Adds Claude Code tool/command permission allow-list.
.claude/workflows/mad-benchmark-sweep.js Adds parallel benchmark sweep workflow.
.claude/workflows/mad-tune-search.js Adds candidate-based tuning search workflow.
CLAUDE.md Adds Claude Code repository guidance for MAD.
mad-automation-howto.html Adds user-facing HTML guide for commands, agents, workflows, and tips.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

const safe = label.replace(/[^A-Za-z0-9._-]/g, '_')
const outFile = `perf_${safe}.csv`
const flags = [
'--tags ' + cell.tag,
const cmd = `madengine run ${flags}${ctx}`

const action = execute
? `If AMD GPUs are present (check rocm-smi/amd-smi), RUN this command and parse the "performance: <value> <unit>" line from output. Results land in ${outFile}. If no GPUs, set status "skipped".`
Comment on lines +56 to +58
const ctxParts = []
if (cell.nGpus) ctxParts.push(`"n_gpus": "${cell.nGpus}"`)
const ctx = ctxParts.length ? ` --additional-context '{${ctxParts.join(', ')}}'` : ''
const nCandidates = Math.max(1, Math.min(cfg.candidates || 4, 8))
const execute = cfg.execute === true

phase('Baseline')
Comment thread .claude/workflows/mad-tune-search.js Outdated
candidates,
(cand) => {
const action = execute
? `If AMD GPUs are present, apply the change, run "madengine run --tags ${tag} --live-output", parse the performance line, then REVERT the change. If no GPUs, status "skipped".`
if sp:
sh_files = [sp] if sp.endswith(".sh") else glob.glob(os.path.join(sp,"**","*.sh"), recursive=True)
for sh in sh_files:
if os.path.isfile(sh) and "performance:" in open(sh, errors="ignore").read():
Comment thread CLAUDE.md
Comment on lines +125 to +130
## Future (skeleton, not yet wired)

`reference_db/mad_agent.db` (tables: `model_baselines`, `optimization_history`,
`best_configurations`, `learned_patterns`) and `knowledge_base/` exist as an
empty scaffold for a future persistent optimization-memory layer. Not populated
yet — do not assume data is present.
Comment on lines +36 to +38
- Smoke-test wiring with a single small tag before large sweeps. (There is no
`dummy` model in this repo's `models.json` — confirm a real tag with
`madengine discover`.)
Comment thread .claude/settings.json
Comment on lines +4 to +10
"Bash(madengine discover *)",
"Bash(python3 -m json.tool *)",
"Bash(git status)",
"Bash(git diff *)",
"Bash(git log *)",
"Bash(rocm-smi *)",
"Bash(amd-smi *)",
coketaste and others added 3 commits June 1, 2026 03:43
…g & tuning

Brings Claude Code's multi-agent automation to MAD: plain slash commands now
drive the full benchmark/tune lifecycle, with dynamic workflows that fan out
specialized subagents and synthesize their results - turning manual, flag-heavy
madengine runs into one-line, self-orchestrating operations.

Headline capabilities:
- mad-benchmark-sweep: a dynamic workflow that benchmarks many models in
  parallel and auto-builds a comparison table, with isolated per-cell output
  so parallel runs never clobber each other.
- mad-tune-search: an agentic, profiling-driven tuning loop. It profiles once
  to DIAGNOSE the real bottleneck (compute/memory/communication/launch),
  proposes evidence-backed candidates, measures each on clean runs, and has an
  independent agent adversarially verify every claimed gain before recommending
  a config - decisions grounded in data, not guesswork.

Engineering changes that make this work:
- Both workflows now accept the CLI-style flags the slash commands actually
  pass (--tags / --additional-context / --plan); the prior object-only parsing
  silently dropped tags and context and defaulted to plan-only. They thread
  --additional-context into every run and execute by default.
- mad-tune-search splits one context into a profiled variant (Diagnose) and a
  clean variant (measurement), evaluates candidates sequentially to avoid GPU
  contention and config-edit races, and isolates output to perf_tune_<id>.csv.
- mad-automation-howto.html updated to match: CLI-flag tables, execute-by-default
  callouts, the 5-phase tune-search, and a bottleneck-to-lever reference table.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Add real results and operational lessons from the Qwen3-8B profiling and
tuning session on MI350X:

- TunableOp warmup warning: first cold run collapses throughput to
  ~0.2-1 tok/s during online GEMM benchmarking; measurements are only
  valid on warm subsequent runs. Added to env var card and Tips section.
- New red callout in tune-search: Docker containers outlive the Claude
  session, so a slow candidate can stall the whole sequential search and
  leave a config edit (e.g. extended.yaml) unapplied. Always pass
  --timeout to bound each candidate run.
- New Tips entry "Qwen3-8B live tuning findings": rocm_trace_lite kernel
  breakdown (Cijk GEMM 27% + wvSplitK 21% = compute-bound), and the
  headline result -- max_concurrency=32 delivered 7993 tok/s throughput
  vs 422 tok/s baseline (18.9x), confirming concurrency as the primary
  vLLM serving lever.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Correct the architecture SVG so all 6 slash commands, both workflows, and
the inline /mad-validate path are represented accurately. Give /mad-profile
its own box, route /mad-validate to an inline-script node, split the two
workflows with distinct fan-out targets, and color-code arrows with a legend.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants