Skip to content

Collect eval findings and pass repo root to agent subprocesses#21

Merged
tikazyq merged 4 commits intomainfrom
claude/run-swebench-harness-44AVk
Mar 18, 2026
Merged

Collect eval findings and pass repo root to agent subprocesses#21
tikazyq merged 4 commits intomainfrom
claude/run-swebench-harness-44AVk

Conversation

@tikazyq
Copy link
Contributor

@tikazyq tikazyq commented Mar 18, 2026

Summary

This change enhances the harness run execution to collect evaluation findings from eval governance logs and ensures agent subprocesses have access to the correct repository root via an environment variable.

Key Changes

  • Eval findings collection: Added collect_eval_findings() function that reads the eval.governance.jsonl log file and extracts findings from entries written during the current harness run. These findings are now included in both the governance record and final output JSON.
  • Repository root propagation: Modified run_agent() and run_agent_with_stdin() to accept and pass the repo_root parameter as the SYNODIC_ROOT environment variable to agent subprocesses.
  • Environment variable support in repo root detection: Updated find_repo_root() in util.rs to respect the SYNODIC_ROOT environment variable, allowing eval subprocesses to write governance logs to the correct project rather than the testbed.

Implementation Details

  • The collect_eval_findings() function reads up to the last 10 lines of the eval governance log in reverse order and filters entries by timestamp to only include those written after the harness run started.
  • Each collected finding includes instance_id, benchmark, resolved, and findings fields extracted from the governance log entries.
  • The eval findings are logged and included in both the gov_record (governance output) and final JSON output for tracking and analysis.

https://claude.ai/code/session_0157iwKYYLnrPU4dyNAxHgQ3

claude added 2 commits March 18, 2026 15:36
Ran synodic eval run swe:django-10097 wrapped in harness governance.
Layer 1 static gate passed. Scoring showed Python version incompatibility
with old Django codebase (codeset keyword removed in newer Python).

https://claude.ai/code/session_0157iwKYYLnrPU4dyNAxHgQ3
Two bugs fixed:

1. find_repo_root() now respects SYNODIC_ROOT env var, which the harness
   sets when spawning eval subprocesses. Previously, eval governance logs
   were written to the testbed's .harness/ (wrong git repo) because
   find_repo_root() resolved CWD to the testbed.

2. Harness governance log now collects eval findings from
   eval.governance.jsonl and includes them in the harness record and
   manifest. Previously the harness only recorded pass/fail status with
   no eval-level learnings.

https://claude.ai/code/session_0157iwKYYLnrPU4dyNAxHgQ3
Copilot AI review requested due to automatic review settings March 18, 2026 15:58
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enhances the harness run workflow by (1) propagating the canonical repository root to agent/eval subprocesses and (2) collecting eval findings from .harness/eval.governance.jsonl into the harness governance record and run manifest, improving traceability across runs.

Changes:

  • Update repo root detection to respect SYNODIC_ROOT so subprocesses can consistently target the correct .harness/ directory.
  • Pass SYNODIC_ROOT (repo root) into agent subprocess environments from harness run.
  • Add collect_eval_findings() and include its output in harness.governance.jsonl records and manifest.json.

Reviewed changes

Copilot reviewed 2 out of 4 changed files in this pull request and generated 2 comments.

File Description
cli/src/util.rs Prefer SYNODIC_ROOT when locating the repo root to avoid writing governance artifacts into testbeds.
cli/src/harness/run.rs Propagate repo root to subprocesses and collect recent eval governance findings into harness outputs.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

// Only collect entries written after this harness run started
if let Some(ts) = record.get("timestamp").and_then(|v| v.as_str()) {
if let Ok(entry_time) = chrono::DateTime::parse_from_rfc3339(ts) {
if entry_time < *run_start {
Comment on lines +495 to +499
let entry = json!({
"instance_id": record.get("instance_id"),
"benchmark": record.get("benchmark"),
"resolved": record.get("resolved"),
"findings": record.get("findings").unwrap_or(&json!([])),
claude added 2 commits March 18, 2026 16:10
Previous approach duplicated eval findings into the harness governance
log by reading back eval.governance.jsonl after the subprocess wrote it.
This was fragile (timestamp matching) and semantically wrong (harness
status was "passed" even when eval scored resolved=false).

Now:
- eval exits non-zero when resolved=false (exit 1)
- harness checks agent exit code as a final gate: if governance layers
  pass but agent reported failure, status becomes "error"
- findings stay in eval.governance.jsonl only (single source of truth)
- removed collect_eval_findings and eval_findings duplication

https://claude.ai/code/session_0157iwKYYLnrPU4dyNAxHgQ3
All entries were from iterative testing of the harness+eval integration.
No production data.

https://claude.ai/code/session_0157iwKYYLnrPU4dyNAxHgQ3
@tikazyq tikazyq merged commit 47365ce into main Mar 18, 2026
2 checks passed
@tikazyq tikazyq deleted the claude/run-swebench-harness-44AVk branch March 18, 2026 16:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants