Collect eval findings and pass repo root to agent subprocesses#21
Merged
Collect eval findings and pass repo root to agent subprocesses#21
Conversation
Ran synodic eval run swe:django-10097 wrapped in harness governance. Layer 1 static gate passed. Scoring showed Python version incompatibility with old Django codebase (codeset keyword removed in newer Python). https://claude.ai/code/session_0157iwKYYLnrPU4dyNAxHgQ3
Two bugs fixed: 1. find_repo_root() now respects SYNODIC_ROOT env var, which the harness sets when spawning eval subprocesses. Previously, eval governance logs were written to the testbed's .harness/ (wrong git repo) because find_repo_root() resolved CWD to the testbed. 2. Harness governance log now collects eval findings from eval.governance.jsonl and includes them in the harness record and manifest. Previously the harness only recorded pass/fail status with no eval-level learnings. https://claude.ai/code/session_0157iwKYYLnrPU4dyNAxHgQ3
There was a problem hiding this comment.
Pull request overview
This PR enhances the harness run workflow by (1) propagating the canonical repository root to agent/eval subprocesses and (2) collecting eval findings from .harness/eval.governance.jsonl into the harness governance record and run manifest, improving traceability across runs.
Changes:
- Update repo root detection to respect
SYNODIC_ROOTso subprocesses can consistently target the correct.harness/directory. - Pass
SYNODIC_ROOT(repo root) into agent subprocess environments fromharness run. - Add
collect_eval_findings()and include its output inharness.governance.jsonlrecords andmanifest.json.
Reviewed changes
Copilot reviewed 2 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
cli/src/util.rs |
Prefer SYNODIC_ROOT when locating the repo root to avoid writing governance artifacts into testbeds. |
cli/src/harness/run.rs |
Propagate repo root to subprocesses and collect recent eval governance findings into harness outputs. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
cli/src/harness/run.rs
Outdated
| // Only collect entries written after this harness run started | ||
| if let Some(ts) = record.get("timestamp").and_then(|v| v.as_str()) { | ||
| if let Ok(entry_time) = chrono::DateTime::parse_from_rfc3339(ts) { | ||
| if entry_time < *run_start { |
cli/src/harness/run.rs
Outdated
Comment on lines
+495
to
+499
| let entry = json!({ | ||
| "instance_id": record.get("instance_id"), | ||
| "benchmark": record.get("benchmark"), | ||
| "resolved": record.get("resolved"), | ||
| "findings": record.get("findings").unwrap_or(&json!([])), |
Previous approach duplicated eval findings into the harness governance log by reading back eval.governance.jsonl after the subprocess wrote it. This was fragile (timestamp matching) and semantically wrong (harness status was "passed" even when eval scored resolved=false). Now: - eval exits non-zero when resolved=false (exit 1) - harness checks agent exit code as a final gate: if governance layers pass but agent reported failure, status becomes "error" - findings stay in eval.governance.jsonl only (single source of truth) - removed collect_eval_findings and eval_findings duplication https://claude.ai/code/session_0157iwKYYLnrPU4dyNAxHgQ3
All entries were from iterative testing of the harness+eval integration. No production data. https://claude.ai/code/session_0157iwKYYLnrPU4dyNAxHgQ3
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This change enhances the harness run execution to collect evaluation findings from eval governance logs and ensures agent subprocesses have access to the correct repository root via an environment variable.
Key Changes
collect_eval_findings()function that reads theeval.governance.jsonllog file and extracts findings from entries written during the current harness run. These findings are now included in both the governance record and final output JSON.run_agent()andrun_agent_with_stdin()to accept and pass therepo_rootparameter as theSYNODIC_ROOTenvironment variable to agent subprocesses.find_repo_root()inutil.rsto respect theSYNODIC_ROOTenvironment variable, allowing eval subprocesses to write governance logs to the correct project rather than the testbed.Implementation Details
collect_eval_findings()function reads up to the last 10 lines of the eval governance log in reverse order and filters entries by timestamp to only include those written after the harness run started.instance_id,benchmark,resolved, andfindingsfields extracted from the governance log entries.gov_record(governance output) and final JSON output for tracking and analysis.https://claude.ai/code/session_0157iwKYYLnrPU4dyNAxHgQ3