test(ce-dispatch): add skill-creator-style eval pack + battle-test results by devin-ai-integration[bot] · Pull Request #5 · Fedgroup-Innovation/compound-engineering-plugin

devin-ai-integration · 2026-05-05T10:10:09Z

Summary

Battle-tests the ce-dispatch single-unit sync MVP rewrite (PR #4) against Anthropic's skill-creator eval protocol, so we have quantitative pre-review evidence that the rewrite produces structurally correct, discriminating output.

Stacks on top of mvp/ce-dispatch-beta-rewrite so PR #4 stays rewrite-focused. Once #4 merges, this PR will retarget to main automatically.

What's added

Path	Purpose
`plugins/compound-engineering/skills/ce-dispatch/evals/evals.json`	4 prompts × 5–9 quantitative expectations covering Phase 0–3 happy-path dispatch and Phase 4 respond-loop branches (review PR, reply to agent comment, mark unit complete). Schema matches Anthropic's `references/schemas.md`.
`plugins/compound-engineering/skills/ce-dispatch/evals/files/sample-multi-unit-plan.md`	Realistic multi-unit plan fixture for Eval 1 (rate-limit feature with three implementation units; dispatch targets U2).
`plugins/compound-engineering/skills/ce-dispatch/evals/scripts/run_eval_pack.py`	Path-A runner that mirrors skill-creator's protocol via direct OpenRouter calls. Defaults to `anthropic/claude-opus-4.7` for both executor and grader. Accepts `--executor-model`, `--iteration`, `--eval-id`, `--runs`, `--skip-without-skill`, `--skip-grader`, `--dry-run`.
`plugins/compound-engineering/evals/ce-dispatch-workspace/iteration-1/`	Full pack run results (4 evals × with-skill + baseline).
`plugins/compound-engineering/evals/ce-dispatch-workspace/iteration-2/`	Refined Eval 2 assertion + re-run, demonstrating the iterate-on-eval-design loop the framework expects.
`plugins/compound-engineering/evals/ce-dispatch-workspace/REPORT.md`	Human-readable battle-test report (recommended starting point).

Headline results (Opus 4.7, ~$1.50 in API costs total)

Configuration	Iter-1 (24 expectations across 4 evals)	After Iter-2 assertion fix
`with_skill`	95% (23/24)	100% (24/24)
`without_skill` baseline	51% (12/24)	56% (13/24)
delta	+44 pp	+44 pp

Per-eval (iteration-1):

Eval 1 happy-path-single-unit-dispatch: 9/9 with-skill, 4/9 baseline (+56 pp)
Eval 2 phase-4-respond-review-pr: 4/5 → 5/5 in iter-2 with-skill, 3/5 → 4/5 baseline
Eval 3 phase-4-respond-reply-to-agent-comment: 5/5 with-skill, 3/5 baseline (+40 pp)
Eval 4 phase-4-respond-mark-unit-complete: 5/5 with-skill, 2/5 baseline (+60 pp)

Findings

Skill-level: no skill bugs surfaced. The skill correctly renders all six required prompt-template sections, scopes content from the correct unit when picking U2 of a multi-unit plan, surfaces exactly four Phase 4 options, routes review work through /ce-code-review, uses --state all consistently, prefixes orchestrator replies with [orchestrator -> <agent-name>] <ISO 8601 UTC>, verifies PR is MERGED before closing the issue, and tells the user to archive the Conductor workspace manually.

Eval-pack-level: one assertion was over-specified. Eval 2's assertion 2 prescribed gh pr diff <number> even when the agent delegated review to /ce-code-review (which fetches its own diff). The agent's behavior was correct; the assertion was too strict. Iteration 2 refined the assertion to allow either path; with-skill went 4/5 → 5/5, baseline went 3/5 → 4/5. Grader's eval_feedback after iter-2: "Expectations are clear and well-targeted at the common failure modes; the output cleanly satisfied all of them."

What this method does and does not test

Tests well: skill prose correctness (right structure, right order, right routing decisions when the skill is loaded), discrimination against baseline (+44 pp), and eval design (grader's eval_feedback flags weak assertions).

Does not test: real tool execution (the runner is single-shot Chat Completions; the agent describes commands rather than executing them — the contract test in PR #4, 63/63 passing, covers the loader/template-render path), multi-turn comment-protocol roundtrip, or real gh issue create/gh pr view execution against GitHub. Those remain the user's manual end-to-end test plan.

Review & Testing Checklist for Human

Yellow risk — this is additive test infrastructure that doesn't change runtime behavior. The skill itself is unchanged. Two items worth checking:

Spot-check at least one grading.json (e.g., iteration-1/1-happy-path-single-unit-dispatch/with_skill/grading.json) — does the grader's per-expectation evidence look genuinely satisfied by the corresponding outputs/output.md? If the grader is too lenient, the 95-100% pass rate would be misleading.
Skim evals/ce-dispatch-workspace/REPORT.md and confirm the four evals cover the surfaces you actually care about for this MVP. If anything important is missing (e.g., the comment-protocol "STOP and wait" semantics, or a dispatch into a worktree that doesn't exist yet), I can add evals.

To re-run the pack yourself (requires OPENROUTER_API_KEY env var):

python3 plugins/compound-engineering/skills/ce-dispatch/evals/scripts/run_eval_pack.py

Notes

The runner uses OpenRouter for transport because Devin doesn't have your Anthropic credentials. It defaults to Opus 4.7 to match the model you'll use this skill with. Substitute via --executor-model anthropic/claude-sonnet-4.5 or any OpenRouter-resolvable Anthropic model.
Raw transcript dumps (transcript-raw.json, *.stream.jsonl) are gitignored as regenerable; analyst-grade artifacts (grading.json, benchmark.json, transcript.md, eval_metadata.json, timing.json, metrics.json, output.md) are committed as battle-test evidence.
Workspace directory lives at plugins/compound-engineering/evals/ce-dispatch-workspace/ (sibling of skills/) rather than inside skills/ so the ce- prefix scanner doesn't treat it as a malformed skill.
bun test: 1307/1308 pass (the 1 fail is the pre-existing resolve-base.sh detached-shallow env failure — same as feat(ce-dispatch): add issue-driven workspace dispatch skill #1/fix(ce-dispatch): address Codex review on EveryInc#762 #2/fix(ce-dispatch): address Codex bot rounds 2-3 on EveryInc#762 + wider audit pass #3/feat(ce-dispatch): single-unit sync MVP rewrite #4 baseline).
bun run release:validate: clean.

Link to Devin session: https://app.devin.ai/sessions/56b17768c2ef4657ba155e2435bf1548
Requested by: @shubness

…sults Battle-tests the single-unit sync MVP rewrite (PR #4) against Anthropic's skill-creator eval protocol (https://github.com/anthropics/skills/tree/main/skills/skill-creator). Adds: - evals/evals.json: 4 prompts covering Phase 0-3 happy-path dispatch and Phase 4 respond-loop branches (review PR, reply to agent comment, mark unit complete). Each prompt has 5-9 quantitative expectations. - evals/files/sample-multi-unit-plan.md: realistic multi-unit plan fixture used by Eval 1 (rate-limit feature with three implementation units; dispatch targets U2). - evals/scripts/run_eval_pack.py: Path-A runner that mirrors skill-creator's protocol via direct OpenRouter calls (system+user with loaded skill vs. baseline; JSON-graded against expectations; aggregated into benchmark.{json,md}). Defaults to claude-opus-4.7 (executor and grader); accepts --executor-model for substitution. - evals/ce-dispatch-workspace/: iteration-1 (full pack) and iteration-2 (refined eval-2 assertion) run results. Analyst-grade artifacts (grading.json, benchmark.json, transcript.md, eval_metadata.json, timing.json, metrics.json, output.md) are committed; raw transcript dumps are gitignored as regenerable. - evals/ce-dispatch-workspace/REPORT.md: human-readable battle-test report. Headline results (Opus 4.7, ~$1.50 in API costs total): with_skill: 24/24 expectations pass after iteration-2 (95% → 100%) without_skill: ~12/24 baseline (51-56%) delta: +44 to +49 percentage points across 4 prompts No skill changes required — eval surfaced one over-prescriptive assertion (caught by the grader's eval_feedback) but no skill bugs. bun test + bun run release:validate remain green (1307/1308 pass; 1 fail is the pre-existing resolve-base.sh detached-shallow env failure).

devin-ai-integration · 2026-05-05T10:10:13Z

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

Disable automatic comment and CI monitoring

devin-ai-integration

Devin Review found 1 potential issue.

View 3 additional findings in Devin Review.

devin-ai-integration · 2026-05-05T15:17:03Z

+        if grader_text.startswith("```"):
+            grader_text = grader_text.split("\n", 1)[1]
+            if grader_text.endswith("```"):
+                grader_text = grader_text.rsplit("```", 1)[0]


🟡 Unhandled IndexError in grader response fence-stripping when response has no newline after opening backticks

When the grader model returns a response that starts with ``` but contains no newline character (e.g., just ``` or ```json), grader_text.split("\n", 1) returns a single-element list, and [1] raises an IndexError. This crashes the script instead of falling through to the json.JSONDecodeError handler below. While unlikely in practice (LLM responses almost always include a newline after an opening fence), this is a gap in the defensive parsing logic that the code explicitly intends to handle.

Suggested change

if grader_text.startswith("```"):

grader_text = grader_text.split("\n", 1)[1]

if grader_text.endswith("```"):

grader_text = grader_text.rsplit("```", 1)[0]

if grader_text.startswith("```"):

parts = grader_text.split("\n", 1)

if len(parts) > 1:

grader_text = parts[1]

if grader_text.endswith("```"):

grader_text = grader_text.rsplit("```", 1)[0]

Was this helpful? React with 👍 or 👎 to provide feedback.

devin-ai-integration Bot assigned shubness May 5, 2026

devin-ai-integration Bot mentioned this pull request May 5, 2026

feat(ce-dispatch): single-unit sync MVP rewrite #4

Draft

4 tasks

devin-ai-integration Bot commented May 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(ce-dispatch): add skill-creator-style eval pack + battle-test results#5

test(ce-dispatch): add skill-creator-style eval pack + battle-test results#5
devin-ai-integration[bot] wants to merge 1 commit intomvp/ce-dispatch-beta-rewritefrom
mvp/ce-dispatch-evals

devin-ai-integration Bot commented May 5, 2026 •

edited

Loading

Uh oh!

devin-ai-integration Bot commented May 5, 2026

Uh oh!

devin-ai-integration Bot left a comment

Uh oh!

devin-ai-integration Bot May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

devin-ai-integration Bot commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's added

Headline results (Opus 4.7, ~$1.50 in API costs total)

Findings

What this method does and does not test

Review & Testing Checklist for Human

Notes

Uh oh!

devin-ai-integration Bot commented May 5, 2026

🤖 Devin AI Engineer

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot May 5, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

devin-ai-integration Bot commented May 5, 2026 •

edited

Loading