Skip to content

test(ce-dispatch): add skill-creator-style eval pack + battle-test results#5

Draft
devin-ai-integration[bot] wants to merge 1 commit intomvp/ce-dispatch-beta-rewritefrom
mvp/ce-dispatch-evals
Draft

test(ce-dispatch): add skill-creator-style eval pack + battle-test results#5
devin-ai-integration[bot] wants to merge 1 commit intomvp/ce-dispatch-beta-rewritefrom
mvp/ce-dispatch-evals

Conversation

@devin-ai-integration
Copy link
Copy Markdown

@devin-ai-integration devin-ai-integration Bot commented May 5, 2026

Summary

Battle-tests the ce-dispatch single-unit sync MVP rewrite (PR #4) against Anthropic's skill-creator eval protocol, so we have quantitative pre-review evidence that the rewrite produces structurally correct, discriminating output.

Stacks on top of mvp/ce-dispatch-beta-rewrite so PR #4 stays rewrite-focused. Once #4 merges, this PR will retarget to main automatically.

What's added

Path Purpose
plugins/compound-engineering/skills/ce-dispatch/evals/evals.json 4 prompts × 5–9 quantitative expectations covering Phase 0–3 happy-path dispatch and Phase 4 respond-loop branches (review PR, reply to agent comment, mark unit complete). Schema matches Anthropic's references/schemas.md.
plugins/compound-engineering/skills/ce-dispatch/evals/files/sample-multi-unit-plan.md Realistic multi-unit plan fixture for Eval 1 (rate-limit feature with three implementation units; dispatch targets U2).
plugins/compound-engineering/skills/ce-dispatch/evals/scripts/run_eval_pack.py Path-A runner that mirrors skill-creator's protocol via direct OpenRouter calls. Defaults to anthropic/claude-opus-4.7 for both executor and grader. Accepts --executor-model, --iteration, --eval-id, --runs, --skip-without-skill, --skip-grader, --dry-run.
plugins/compound-engineering/evals/ce-dispatch-workspace/iteration-1/ Full pack run results (4 evals × with-skill + baseline).
plugins/compound-engineering/evals/ce-dispatch-workspace/iteration-2/ Refined Eval 2 assertion + re-run, demonstrating the iterate-on-eval-design loop the framework expects.
plugins/compound-engineering/evals/ce-dispatch-workspace/REPORT.md Human-readable battle-test report (recommended starting point).

Headline results (Opus 4.7, ~$1.50 in API costs total)

Configuration Iter-1 (24 expectations across 4 evals) After Iter-2 assertion fix
with_skill 95% (23/24) 100% (24/24)
without_skill baseline 51% (12/24) 56% (13/24)
delta +44 pp +44 pp

Per-eval (iteration-1):

  • Eval 1 happy-path-single-unit-dispatch: 9/9 with-skill, 4/9 baseline (+56 pp)
  • Eval 2 phase-4-respond-review-pr: 4/5 → 5/5 in iter-2 with-skill, 3/5 → 4/5 baseline
  • Eval 3 phase-4-respond-reply-to-agent-comment: 5/5 with-skill, 3/5 baseline (+40 pp)
  • Eval 4 phase-4-respond-mark-unit-complete: 5/5 with-skill, 2/5 baseline (+60 pp)

Findings

Skill-level: no skill bugs surfaced. The skill correctly renders all six required prompt-template sections, scopes content from the correct unit when picking U2 of a multi-unit plan, surfaces exactly four Phase 4 options, routes review work through /ce-code-review, uses --state all consistently, prefixes orchestrator replies with [orchestrator -> <agent-name>] <ISO 8601 UTC>, verifies PR is MERGED before closing the issue, and tells the user to archive the Conductor workspace manually.

Eval-pack-level: one assertion was over-specified. Eval 2's assertion 2 prescribed gh pr diff <number> even when the agent delegated review to /ce-code-review (which fetches its own diff). The agent's behavior was correct; the assertion was too strict. Iteration 2 refined the assertion to allow either path; with-skill went 4/5 → 5/5, baseline went 3/5 → 4/5. Grader's eval_feedback after iter-2: "Expectations are clear and well-targeted at the common failure modes; the output cleanly satisfied all of them."

What this method does and does not test

Tests well: skill prose correctness (right structure, right order, right routing decisions when the skill is loaded), discrimination against baseline (+44 pp), and eval design (grader's eval_feedback flags weak assertions).

Does not test: real tool execution (the runner is single-shot Chat Completions; the agent describes commands rather than executing them — the contract test in PR #4, 63/63 passing, covers the loader/template-render path), multi-turn comment-protocol roundtrip, or real gh issue create/gh pr view execution against GitHub. Those remain the user's manual end-to-end test plan.

Review & Testing Checklist for Human

Yellow risk — this is additive test infrastructure that doesn't change runtime behavior. The skill itself is unchanged. Two items worth checking:

  • Spot-check at least one grading.json (e.g., iteration-1/1-happy-path-single-unit-dispatch/with_skill/grading.json) — does the grader's per-expectation evidence look genuinely satisfied by the corresponding outputs/output.md? If the grader is too lenient, the 95-100% pass rate would be misleading.
  • Skim evals/ce-dispatch-workspace/REPORT.md and confirm the four evals cover the surfaces you actually care about for this MVP. If anything important is missing (e.g., the comment-protocol "STOP and wait" semantics, or a dispatch into a worktree that doesn't exist yet), I can add evals.

To re-run the pack yourself (requires OPENROUTER_API_KEY env var):

python3 plugins/compound-engineering/skills/ce-dispatch/evals/scripts/run_eval_pack.py

Notes

Link to Devin session: https://app.devin.ai/sessions/56b17768c2ef4657ba155e2435bf1548
Requested by: @shubness


Open in Devin Review

…sults

Battle-tests the single-unit sync MVP rewrite (PR #4) against Anthropic's
skill-creator eval protocol (https://github.com/anthropics/skills/tree/main/skills/skill-creator).

Adds:
- evals/evals.json: 4 prompts covering Phase 0-3 happy-path dispatch and
  Phase 4 respond-loop branches (review PR, reply to agent comment, mark
  unit complete). Each prompt has 5-9 quantitative expectations.
- evals/files/sample-multi-unit-plan.md: realistic multi-unit plan
  fixture used by Eval 1 (rate-limit feature with three implementation
  units; dispatch targets U2).
- evals/scripts/run_eval_pack.py: Path-A runner that mirrors
  skill-creator's protocol via direct OpenRouter calls (system+user with
  loaded skill vs. baseline; JSON-graded against expectations; aggregated
  into benchmark.{json,md}). Defaults to claude-opus-4.7 (executor and
  grader); accepts --executor-model for substitution.
- evals/ce-dispatch-workspace/: iteration-1 (full pack) and iteration-2
  (refined eval-2 assertion) run results. Analyst-grade artifacts
  (grading.json, benchmark.json, transcript.md, eval_metadata.json,
  timing.json, metrics.json, output.md) are committed; raw transcript
  dumps are gitignored as regenerable.
- evals/ce-dispatch-workspace/REPORT.md: human-readable battle-test
  report.

Headline results (Opus 4.7, ~$1.50 in API costs total):
  with_skill:    24/24 expectations pass after iteration-2 (95% → 100%)
  without_skill: ~12/24 baseline (51-56%)
  delta:         +44 to +49 percentage points across 4 prompts

No skill changes required — eval surfaced one over-prescriptive
assertion (caught by the grader's eval_feedback) but no skill bugs.

bun test + bun run release:validate remain green (1307/1308 pass; 1 fail
is the pre-existing resolve-base.sh detached-shallow env failure).
@devin-ai-integration
Copy link
Copy Markdown
Author

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

Copy link
Copy Markdown
Author

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 potential issue.

View 3 additional findings in Devin Review.

Open in Devin Review

Comment on lines +397 to +400
if grader_text.startswith("```"):
grader_text = grader_text.split("\n", 1)[1]
if grader_text.endswith("```"):
grader_text = grader_text.rsplit("```", 1)[0]
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Unhandled IndexError in grader response fence-stripping when response has no newline after opening backticks

When the grader model returns a response that starts with ``` but contains no newline character (e.g., just ``` or ```json), grader_text.split("\n", 1) returns a single-element list, and [1] raises an IndexError. This crashes the script instead of falling through to the json.JSONDecodeError handler below. While unlikely in practice (LLM responses almost always include a newline after an opening fence), this is a gap in the defensive parsing logic that the code explicitly intends to handle.

Suggested change
if grader_text.startswith("```"):
grader_text = grader_text.split("\n", 1)[1]
if grader_text.endswith("```"):
grader_text = grader_text.rsplit("```", 1)[0]
if grader_text.startswith("```"):
parts = grader_text.split("\n", 1)
if len(parts) > 1:
grader_text = parts[1]
if grader_text.endswith("```"):
grader_text = grader_text.rsplit("```", 1)[0]
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant