spec: Decouple eval framework as standalone testing tool#24
Merged
Conversation
Adds spec 047 with architecture for extracting eval into a standalone crate (synodic-eval) within a Cargo workspace, with EvalReporter trait to replace hardcoded governance log coupling. https://claude.ai/code/session_01NhfevEyKE5jXFdwWtSqVU2
Drop EvalReporter trait design in favor of complete separation — eval produces JSON output only, harness reads it and writes its own governance logs. No governance concepts leak into eval at all. https://claude.ai/code/session_01NhfevEyKE5jXFdwWtSqVU2
There was a problem hiding this comment.
Pull request overview
Adds a new project spec describing how to decouple the existing eval framework from Synodic’s governance harness so eval can run as a standalone tool that emits structured results (JSON + exit codes), with the harness consuming those results and producing governance logs.
Changes:
- Introduces spec 047 outlining separation of concerns between eval execution and governance logging.
- Proposes a Cargo workspace split into
synodic(harness) andsynodic-eval(standalone eval). - Defines (at a high level) an eval output contract and updated project-root discovery approach.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
Comment on lines
+77
to
+95
| ### Eval output contract | ||
|
|
||
| Eval communicates results through two channels only: | ||
|
|
||
| **1. Exit code** — `0` = resolved, `1` = not resolved, `2` = error | ||
| **2. Structured JSON output** — written to `--output <path>` or stdout: | ||
|
|
||
| ```json | ||
| { | ||
| "instance_id": "django__django-16379", | ||
| "benchmark": "swebench", | ||
| "skill": "factory", | ||
| "resolved": true, | ||
| "duration_s": 142, | ||
| "f2p": { "group": "FAIL_TO_PASS", "expected": 3, "passed": 3 }, | ||
| "p2p": { "group": "PASS_TO_PASS", "expected": 47, "passed": 47 }, | ||
| "score_report": "path/to/score_report.json" | ||
| } | ||
| ``` |
|
|
||
| ### Project root discovery | ||
|
|
||
| Eval needs to find the project root (for `evals/evals.json`). Currently it piggybacks on `find_repo_root()` which looks for `.harness/`. After decoupling: |
| The eval framework (setup → agent → score pipeline) is a general-purpose AI coding evaluation tool that should work independently of synodic's governance harness. Today eval directly writes to `.harness/eval.governance.jsonl` and reads `SYNODIC_ROOT` — it has no business knowing about governance at all. | ||
|
|
||
| **Why now:** Eval is mature enough (29 tests, 3 benchmarks, batch mode) to stand alone. Complete separation enables: | ||
| - Eval as a zero-dependency testing framework — no governance concepts leak in |
Contributor
Author
|
@copilot implement |
Contributor
Split cli/ into a Cargo workspace with two crates: - synodic-eval: standalone eval framework (setup → agent → score) - Zero governance dependencies — no .harness/, no SYNODIC_ROOT - run::execute() returns EvalResult instead of writing gov logs - Uses EVAL_ROOT env var and find_project_root() (evals/ or .git) - Produces standalone binary: synodic-eval run|score|list|batch|report - All 35 tests pass independently - synodic: governance CLI depending on synodic-eval as library - New governance.rs: reads EvalResult, writes .harness/eval.governance.jsonl - extract_findings() and append_governance_log() moved here from eval - harness/run.rs sets EVAL_ROOT alongside SYNODIC_ROOT for agent subprocesses - synodic eval run still works (dispatches to synodic_eval, writes gov log) Implements spec 047-decouple-eval-framework. https://claude.ai/code/session_01NhfevEyKE5jXFdwWtSqVU2
Contributor
Author
|
@copilot resolve the merge conflicts |
Contributor
Co-authored-by: tikazyq <3393101+tikazyq@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This spec proposes decoupling the eval framework from synodic's governance harness, enabling it to function as an independent, zero-dependency testing framework. Currently, eval is tightly coupled to governance concerns (writing to
.harness/eval.governance.jsonl, readingSYNODIC_ROOT), preventing external use and independent versioning.Design Changes
Separation of Concerns:
Architecture:
synodic-eval: Standalone eval framework (no synodic dependencies)synodic: Governance harness that consumes eval outputsynodic/src/governance.rshandles reading eval JSON and writing governance JSONLCode Removals from Eval:
append_governance_log()function (eval/run.rs:486-534)extract_findings()helper (governance categorization moved to harness).harness/directory creation and referencesSYNODIC_ROOTenvironment variable readsProject Root Discovery:
find_repo_root(): eval getsfind_project_root()(looks forevals/or.git), harness keeps original (looks for.harness/)SYNODIC_ROOTwithEVAL_ROOTenv var for eval-specific configurationKey Benefits
Testing Strategy
.harness/https://claude.ai/code/session_01NhfevEyKE5jXFdwWtSqVU2