Add Claude Code automation layer for MAD (commands, agents, workflows)#160
Draft
coketaste wants to merge 6 commits into
Draft
Add Claude Code automation layer for MAD (commands, agents, workflows)#160coketaste wants to merge 6 commits into
coketaste wants to merge 6 commits into
Conversation
Add a foundation pack so common MAD tasks (benchmarking, adding models, tuning, development) can be driven through Claude Code with the repo's conventions baked in: - CLAUDE.md: models.json schema, 4-step add-model flow, the "performance: <value> <unit>" stdout contract, madengine v2.1.0 CLI commands, deployment inference, and profiling. - .claude/agents/: mad-model-author, mad-perf-analyst, mad-benchmark-runner, mad-tuner. - .claude/commands/: mad-add-model, mad-benchmark, mad-profile, mad-report, mad-tune. - .claude/workflows/: mad-benchmark-sweep and mad-tune-search dynamic workflows (plan-only by default; execute:true on a GPU host). - .claude/settings.json: shared read-only/common command allowlist. Commands and docs verified against the installed madengine Typer CLI v2.1.0 (@main): top-level build/run/discover/report/database. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
- mad-benchmark-sweep: drop the non-functional precision axis (precision is fixed per-model via training_precision or baked into the inference image, not a runtime flag) and give each cell its own perf_<cell>.csv to avoid clobbering a shared perf.csv under parallel execution; flag unresolved/errored cells. - mad-model-author: document the multiple_results output contract alongside the performance: stdout line, and confirm new entries resolve via discover. - Add /mad-validate: GPU-free static checker (JSON, paths, Dockerfile CONTEXT header, output contract) with errors vs convention-warning severities. - Point profiling agent/command at scripts/common/tools.json as the source of truth and document the deploy-key convention. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
Adds a Claude Code automation layer for MAD so common benchmarking, profiling, reporting, tuning, validation, and model-authoring tasks can be driven through slash commands, specialized agents, and two workflow scripts.
Changes:
- Adds Claude Code command prompts, agent definitions, workflow scripts, and permission settings under
.claude/. - Adds repository guidance in
CLAUDE.mdfor MAD conventions and madengine usage. - Adds a standalone HTML how-to guide documenting the automation layer.
Reviewed changes
Copilot reviewed 15 out of 15 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
.claude/agents/mad-benchmark-runner.md |
Defines benchmark/profile command assembly and execution behavior. |
.claude/agents/mad-model-author.md |
Defines model scaffolding guidance for new MAD entries. |
.claude/agents/mad-perf-analyst.md |
Defines read-only benchmark result analysis behavior. |
.claude/agents/mad-tuner.md |
Defines iterative tuning behavior. |
.claude/commands/mad-add-model.md |
Adds slash command prompt for adding models. |
.claude/commands/mad-benchmark.md |
Adds slash command prompt for benchmark runs. |
.claude/commands/mad-profile.md |
Adds slash command prompt for profiled benchmark runs. |
.claude/commands/mad-report.md |
Adds slash command prompt for result analysis. |
.claude/commands/mad-tune.md |
Adds slash command prompt for tuning. |
.claude/commands/mad-validate.md |
Adds static validation command for MAD model entries. |
.claude/settings.json |
Adds Claude Code tool/command permission allow-list. |
.claude/workflows/mad-benchmark-sweep.js |
Adds parallel benchmark sweep workflow. |
.claude/workflows/mad-tune-search.js |
Adds candidate-based tuning search workflow. |
CLAUDE.md |
Adds Claude Code repository guidance for MAD. |
mad-automation-howto.html |
Adds user-facing HTML guide for commands, agents, workflows, and tips. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| const safe = label.replace(/[^A-Za-z0-9._-]/g, '_') | ||
| const outFile = `perf_${safe}.csv` | ||
| const flags = [ | ||
| '--tags ' + cell.tag, |
| const cmd = `madengine run ${flags}${ctx}` | ||
|
|
||
| const action = execute | ||
| ? `If AMD GPUs are present (check rocm-smi/amd-smi), RUN this command and parse the "performance: <value> <unit>" line from output. Results land in ${outFile}. If no GPUs, set status "skipped".` |
Comment on lines
+56
to
+58
| const ctxParts = [] | ||
| if (cell.nGpus) ctxParts.push(`"n_gpus": "${cell.nGpus}"`) | ||
| const ctx = ctxParts.length ? ` --additional-context '{${ctxParts.join(', ')}}'` : '' |
| const nCandidates = Math.max(1, Math.min(cfg.candidates || 4, 8)) | ||
| const execute = cfg.execute === true | ||
|
|
||
| phase('Baseline') |
| candidates, | ||
| (cand) => { | ||
| const action = execute | ||
| ? `If AMD GPUs are present, apply the change, run "madengine run --tags ${tag} --live-output", parse the performance line, then REVERT the change. If no GPUs, status "skipped".` |
| if sp: | ||
| sh_files = [sp] if sp.endswith(".sh") else glob.glob(os.path.join(sp,"**","*.sh"), recursive=True) | ||
| for sh in sh_files: | ||
| if os.path.isfile(sh) and "performance:" in open(sh, errors="ignore").read(): |
Comment on lines
+125
to
+130
| ## Future (skeleton, not yet wired) | ||
|
|
||
| `reference_db/mad_agent.db` (tables: `model_baselines`, `optimization_history`, | ||
| `best_configurations`, `learned_patterns`) and `knowledge_base/` exist as an | ||
| empty scaffold for a future persistent optimization-memory layer. Not populated | ||
| yet — do not assume data is present. |
Comment on lines
+36
to
+38
| - Smoke-test wiring with a single small tag before large sweeps. (There is no | ||
| `dummy` model in this repo's `models.json` — confirm a real tag with | ||
| `madengine discover`.) |
Comment on lines
+4
to
+10
| "Bash(madengine discover *)", | ||
| "Bash(python3 -m json.tool *)", | ||
| "Bash(git status)", | ||
| "Bash(git diff *)", | ||
| "Bash(git log *)", | ||
| "Bash(rocm-smi *)", | ||
| "Bash(amd-smi *)", |
…g & tuning Brings Claude Code's multi-agent automation to MAD: plain slash commands now drive the full benchmark/tune lifecycle, with dynamic workflows that fan out specialized subagents and synthesize their results - turning manual, flag-heavy madengine runs into one-line, self-orchestrating operations. Headline capabilities: - mad-benchmark-sweep: a dynamic workflow that benchmarks many models in parallel and auto-builds a comparison table, with isolated per-cell output so parallel runs never clobber each other. - mad-tune-search: an agentic, profiling-driven tuning loop. It profiles once to DIAGNOSE the real bottleneck (compute/memory/communication/launch), proposes evidence-backed candidates, measures each on clean runs, and has an independent agent adversarially verify every claimed gain before recommending a config - decisions grounded in data, not guesswork. Engineering changes that make this work: - Both workflows now accept the CLI-style flags the slash commands actually pass (--tags / --additional-context / --plan); the prior object-only parsing silently dropped tags and context and defaulted to plan-only. They thread --additional-context into every run and execute by default. - mad-tune-search splits one context into a profiled variant (Diagnose) and a clean variant (measurement), evaluates candidates sequentially to avoid GPU contention and config-edit races, and isolates output to perf_tune_<id>.csv. - mad-automation-howto.html updated to match: CLI-flag tables, execute-by-default callouts, the 5-phase tune-search, and a bottleneck-to-lever reference table. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Add real results and operational lessons from the Qwen3-8B profiling and tuning session on MI350X: - TunableOp warmup warning: first cold run collapses throughput to ~0.2-1 tok/s during online GEMM benchmarking; measurements are only valid on warm subsequent runs. Added to env var card and Tips section. - New red callout in tune-search: Docker containers outlive the Claude session, so a slow candidate can stall the whole sequential search and leave a config edit (e.g. extended.yaml) unapplied. Always pass --timeout to bound each candidate run. - New Tips entry "Qwen3-8B live tuning findings": rocm_trace_lite kernel breakdown (Cijk GEMM 27% + wvSplitK 21% = compute-bound), and the headline result -- max_concurrency=32 delivered 7993 tok/s throughput vs 422 tok/s baseline (18.9x), confirming concurrency as the primary vLLM serving lever. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Correct the architecture SVG so all 6 slash commands, both workflows, and the inline /mad-validate path are represented accurately. Give /mad-profile its own box, route /mad-validate to an inline-script node, split the two workflows with distinct fan-out targets, and color-code arrows with a legend. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary