test(ce-dispatch): add skill-creator-style eval pack + battle-test results#5
Conversation
…sults Battle-tests the single-unit sync MVP rewrite (PR #4) against Anthropic's skill-creator eval protocol (https://github.com/anthropics/skills/tree/main/skills/skill-creator). Adds: - evals/evals.json: 4 prompts covering Phase 0-3 happy-path dispatch and Phase 4 respond-loop branches (review PR, reply to agent comment, mark unit complete). Each prompt has 5-9 quantitative expectations. - evals/files/sample-multi-unit-plan.md: realistic multi-unit plan fixture used by Eval 1 (rate-limit feature with three implementation units; dispatch targets U2). - evals/scripts/run_eval_pack.py: Path-A runner that mirrors skill-creator's protocol via direct OpenRouter calls (system+user with loaded skill vs. baseline; JSON-graded against expectations; aggregated into benchmark.{json,md}). Defaults to claude-opus-4.7 (executor and grader); accepts --executor-model for substitution. - evals/ce-dispatch-workspace/: iteration-1 (full pack) and iteration-2 (refined eval-2 assertion) run results. Analyst-grade artifacts (grading.json, benchmark.json, transcript.md, eval_metadata.json, timing.json, metrics.json, output.md) are committed; raw transcript dumps are gitignored as regenerable. - evals/ce-dispatch-workspace/REPORT.md: human-readable battle-test report. Headline results (Opus 4.7, ~$1.50 in API costs total): with_skill: 24/24 expectations pass after iteration-2 (95% → 100%) without_skill: ~12/24 baseline (51-56%) delta: +44 to +49 percentage points across 4 prompts No skill changes required — eval surfaced one over-prescriptive assertion (caught by the grader's eval_feedback) but no skill bugs. bun test + bun run release:validate remain green (1307/1308 pass; 1 fail is the pre-existing resolve-base.sh detached-shallow env failure).
🤖 Devin AI EngineerI'll be helping with this pull request! Here's what you should know: ✅ I will automatically:
Note: I can only respond to comments from users who have write access to this repository. ⚙️ Control Options:
|
| if grader_text.startswith("```"): | ||
| grader_text = grader_text.split("\n", 1)[1] | ||
| if grader_text.endswith("```"): | ||
| grader_text = grader_text.rsplit("```", 1)[0] |
There was a problem hiding this comment.
🟡 Unhandled IndexError in grader response fence-stripping when response has no newline after opening backticks
When the grader model returns a response that starts with ``` but contains no newline character (e.g., just ``` or ```json), grader_text.split("\n", 1) returns a single-element list, and [1] raises an IndexError. This crashes the script instead of falling through to the json.JSONDecodeError handler below. While unlikely in practice (LLM responses almost always include a newline after an opening fence), this is a gap in the defensive parsing logic that the code explicitly intends to handle.
| if grader_text.startswith("```"): | |
| grader_text = grader_text.split("\n", 1)[1] | |
| if grader_text.endswith("```"): | |
| grader_text = grader_text.rsplit("```", 1)[0] | |
| if grader_text.startswith("```"): | |
| parts = grader_text.split("\n", 1) | |
| if len(parts) > 1: | |
| grader_text = parts[1] | |
| if grader_text.endswith("```"): | |
| grader_text = grader_text.rsplit("```", 1)[0] |
Was this helpful? React with 👍 or 👎 to provide feedback.
Summary
Battle-tests the
ce-dispatchsingle-unit sync MVP rewrite (PR #4) against Anthropic's skill-creator eval protocol, so we have quantitative pre-review evidence that the rewrite produces structurally correct, discriminating output.Stacks on top of
mvp/ce-dispatch-beta-rewriteso PR #4 stays rewrite-focused. Once #4 merges, this PR will retarget tomainautomatically.What's added
plugins/compound-engineering/skills/ce-dispatch/evals/evals.jsonreferences/schemas.md.plugins/compound-engineering/skills/ce-dispatch/evals/files/sample-multi-unit-plan.mdplugins/compound-engineering/skills/ce-dispatch/evals/scripts/run_eval_pack.pyanthropic/claude-opus-4.7for both executor and grader. Accepts--executor-model,--iteration,--eval-id,--runs,--skip-without-skill,--skip-grader,--dry-run.plugins/compound-engineering/evals/ce-dispatch-workspace/iteration-1/plugins/compound-engineering/evals/ce-dispatch-workspace/iteration-2/plugins/compound-engineering/evals/ce-dispatch-workspace/REPORT.mdHeadline results (Opus 4.7, ~$1.50 in API costs total)
with_skillwithout_skillbaselinePer-eval (iteration-1):
happy-path-single-unit-dispatch: 9/9 with-skill, 4/9 baseline (+56 pp)phase-4-respond-review-pr: 4/5 → 5/5 in iter-2 with-skill, 3/5 → 4/5 baselinephase-4-respond-reply-to-agent-comment: 5/5 with-skill, 3/5 baseline (+40 pp)phase-4-respond-mark-unit-complete: 5/5 with-skill, 2/5 baseline (+60 pp)Findings
Skill-level: no skill bugs surfaced. The skill correctly renders all six required prompt-template sections, scopes content from the correct unit when picking U2 of a multi-unit plan, surfaces exactly four Phase 4 options, routes review work through
/ce-code-review, uses--state allconsistently, prefixes orchestrator replies with[orchestrator -> <agent-name>] <ISO 8601 UTC>, verifies PR isMERGEDbefore closing the issue, and tells the user to archive the Conductor workspace manually.Eval-pack-level: one assertion was over-specified. Eval 2's assertion 2 prescribed
gh pr diff <number>even when the agent delegated review to/ce-code-review(which fetches its own diff). The agent's behavior was correct; the assertion was too strict. Iteration 2 refined the assertion to allow either path; with-skill went 4/5 → 5/5, baseline went 3/5 → 4/5. Grader'seval_feedbackafter iter-2: "Expectations are clear and well-targeted at the common failure modes; the output cleanly satisfied all of them."What this method does and does not test
Tests well: skill prose correctness (right structure, right order, right routing decisions when the skill is loaded), discrimination against baseline (
+44 pp), and eval design (grader'seval_feedbackflags weak assertions).Does not test: real tool execution (the runner is single-shot Chat Completions; the agent describes commands rather than executing them — the contract test in PR #4, 63/63 passing, covers the loader/template-render path), multi-turn comment-protocol roundtrip, or real
gh issue create/gh pr viewexecution against GitHub. Those remain the user's manual end-to-end test plan.Review & Testing Checklist for Human
Yellow risk — this is additive test infrastructure that doesn't change runtime behavior. The skill itself is unchanged. Two items worth checking:
grading.json(e.g.,iteration-1/1-happy-path-single-unit-dispatch/with_skill/grading.json) — does the grader's per-expectation evidence look genuinely satisfied by the correspondingoutputs/output.md? If the grader is too lenient, the 95-100% pass rate would be misleading.evals/ce-dispatch-workspace/REPORT.mdand confirm the four evals cover the surfaces you actually care about for this MVP. If anything important is missing (e.g., the comment-protocol "STOP and wait" semantics, or a dispatch into a worktree that doesn't exist yet), I can add evals.To re-run the pack yourself (requires
OPENROUTER_API_KEYenv var):Notes
--executor-model anthropic/claude-sonnet-4.5or any OpenRouter-resolvable Anthropic model.transcript-raw.json,*.stream.jsonl) are gitignored as regenerable; analyst-grade artifacts (grading.json,benchmark.json,transcript.md,eval_metadata.json,timing.json,metrics.json,output.md) are committed as battle-test evidence.plugins/compound-engineering/evals/ce-dispatch-workspace/(sibling ofskills/) rather than insideskills/so thece-prefix scanner doesn't treat it as a malformed skill.bun test: 1307/1308 pass (the 1 fail is the pre-existingresolve-base.shdetached-shallow env failure — same as feat(ce-dispatch): add issue-driven workspace dispatch skill #1/fix(ce-dispatch): address Codex review on EveryInc#762 #2/fix(ce-dispatch): address Codex bot rounds 2-3 on EveryInc#762 + wider audit pass #3/feat(ce-dispatch): single-unit sync MVP rewrite #4 baseline).bun run release:validate: clean.Link to Devin session: https://app.devin.ai/sessions/56b17768c2ef4657ba155e2435bf1548
Requested by: @shubness