Skip to content

spec: Decouple eval framework as standalone testing tool#24

Merged
tikazyq merged 7 commits intomainfrom
claude/decouple-eval-framework-GIOVK
Mar 19, 2026
Merged

spec: Decouple eval framework as standalone testing tool#24
tikazyq merged 7 commits intomainfrom
claude/decouple-eval-framework-GIOVK

Conversation

@tikazyq
Copy link
Contributor

@tikazyq tikazyq commented Mar 18, 2026

Summary

This spec proposes decoupling the eval framework from synodic's governance harness, enabling it to function as an independent, zero-dependency testing framework. Currently, eval is tightly coupled to governance concerns (writing to .harness/eval.governance.jsonl, reading SYNODIC_ROOT), preventing external use and independent versioning.

Design Changes

Separation of Concerns:

  • Eval produces only structured JSON output (verdict + score reports) and exit codes
  • Synodic's harness becomes the consumer: it invokes eval, reads output, and writes governance logs
  • All governance-specific code removed from eval codebase

Architecture:

  • Restructure as a Cargo workspace with two member crates:
    • synodic-eval: Standalone eval framework (no synodic dependencies)
    • synodic: Governance harness that consumes eval output
  • New synodic/src/governance.rs handles reading eval JSON and writing governance JSONL

Code Removals from Eval:

  • append_governance_log() function (eval/run.rs:486-534)
  • extract_findings() helper (governance categorization moved to harness)
  • All .harness/ directory creation and references
  • SYNODIC_ROOT environment variable reads
  • Harness-specific comments and cross-run learning references

Project Root Discovery:

  • Split find_repo_root(): eval gets find_project_root() (looks for evals/ or .git), harness keeps original (looks for .harness/)
  • Replace SYNODIC_ROOT with EVAL_ROOT env var for eval-specific configuration

Key Benefits

  • Eval usable as standalone testing framework without governance infrastructure
  • Independent versioning and release cycles
  • External teams can adopt eval without synodic governance
  • Cleaner separation enables easier maintenance and testing
  • Governance log schema remains unchanged (backward compatible)

Testing Strategy

  • All 29 existing eval tests pass in standalone crate
  • Standalone binary builds with zero synodic dependencies
  • Eval works in directories without .harness/
  • Synodic harness integration still functional
  • No harness/governance references leak into eval codebase

https://claude.ai/code/session_01NhfevEyKE5jXFdwWtSqVU2

claude added 2 commits March 18, 2026 22:30
Adds spec 047 with architecture for extracting eval into a standalone
crate (synodic-eval) within a Cargo workspace, with EvalReporter trait
to replace hardcoded governance log coupling.

https://claude.ai/code/session_01NhfevEyKE5jXFdwWtSqVU2
Drop EvalReporter trait design in favor of complete separation —
eval produces JSON output only, harness reads it and writes its
own governance logs. No governance concepts leak into eval at all.

https://claude.ai/code/session_01NhfevEyKE5jXFdwWtSqVU2
Copilot AI review requested due to automatic review settings March 18, 2026 23:16
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new project spec describing how to decouple the existing eval framework from Synodic’s governance harness so eval can run as a standalone tool that emits structured results (JSON + exit codes), with the harness consuming those results and producing governance logs.

Changes:

  • Introduces spec 047 outlining separation of concerns between eval execution and governance logging.
  • Proposes a Cargo workspace split into synodic (harness) and synodic-eval (standalone eval).
  • Defines (at a high level) an eval output contract and updated project-root discovery approach.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment on lines +77 to +95
### Eval output contract

Eval communicates results through two channels only:

**1. Exit code** — `0` = resolved, `1` = not resolved, `2` = error
**2. Structured JSON output** — written to `--output <path>` or stdout:

```json
{
"instance_id": "django__django-16379",
"benchmark": "swebench",
"skill": "factory",
"resolved": true,
"duration_s": 142,
"f2p": { "group": "FAIL_TO_PASS", "expected": 3, "passed": 3 },
"p2p": { "group": "PASS_TO_PASS", "expected": 47, "passed": 47 },
"score_report": "path/to/score_report.json"
}
```

### Project root discovery

Eval needs to find the project root (for `evals/evals.json`). Currently it piggybacks on `find_repo_root()` which looks for `.harness/`. After decoupling:
The eval framework (setup → agent → score pipeline) is a general-purpose AI coding evaluation tool that should work independently of synodic's governance harness. Today eval directly writes to `.harness/eval.governance.jsonl` and reads `SYNODIC_ROOT` — it has no business knowing about governance at all.

**Why now:** Eval is mature enough (29 tests, 3 benchmarks, batch mode) to stand alone. Complete separation enables:
- Eval as a zero-dependency testing framework — no governance concepts leak in
@tikazyq
Copy link
Contributor Author

tikazyq commented Mar 18, 2026

@copilot implement

Copy link
Contributor

Copilot AI commented Mar 18, 2026

@tikazyq I've opened a new pull request, #25, to work on those changes. Once the pull request is ready, I'll request review from you.

Split cli/ into a Cargo workspace with two crates:

- synodic-eval: standalone eval framework (setup → agent → score)
  - Zero governance dependencies — no .harness/, no SYNODIC_ROOT
  - run::execute() returns EvalResult instead of writing gov logs
  - Uses EVAL_ROOT env var and find_project_root() (evals/ or .git)
  - Produces standalone binary: synodic-eval run|score|list|batch|report
  - All 35 tests pass independently

- synodic: governance CLI depending on synodic-eval as library
  - New governance.rs: reads EvalResult, writes .harness/eval.governance.jsonl
  - extract_findings() and append_governance_log() moved here from eval
  - harness/run.rs sets EVAL_ROOT alongside SYNODIC_ROOT for agent subprocesses
  - synodic eval run still works (dispatches to synodic_eval, writes gov log)

Implements spec 047-decouple-eval-framework.

https://claude.ai/code/session_01NhfevEyKE5jXFdwWtSqVU2
@tikazyq
Copy link
Contributor Author

tikazyq commented Mar 19, 2026

@copilot resolve the merge conflicts

Copy link
Contributor

Copilot AI commented Mar 19, 2026

@tikazyq I've opened a new pull request, #27, to work on those changes. Once the pull request is ready, I'll request review from you.

@tikazyq tikazyq merged commit bc2df9f into main Mar 19, 2026
2 checks passed
@tikazyq tikazyq deleted the claude/decouple-eval-framework-GIOVK branch March 19, 2026 00:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants