evals: only judge based on spec

charleslien · charleslien · commit 57060c6daa00 · 2025-09-04T14:23:34.000-07:00
diff --git a/evals/git-evals/judge-git-eval.ts b/evals/git-evals/judge-git-eval.ts
@@ -49,7 +49,7 @@ function buildAnalysisPrompt(
       )
       .join('\n\n')
 
-  return `You are an expert software engineer tasked with analyzing and scoring the code quality of changes made by an AI coding assistant (Codebuff). Please analyze the following interaction trace and compare both the attempted changes and the ground truth changes.
+  return `You are an expert software engineer tasked with analyzing and scoring the code quality of changes made by an AI coding assistant (Codebuff). Please analyze and compare both the attempted changes and the ground truth changes.
 
 [SPEC]
 ${evalRun.eval_commit.spec}
@@ -75,6 +75,8 @@ Please analyze the implementation attempt and provide:
    - Code Quality: How well-structured, maintainable and idiomatic is the code?
    - Overall: Combined assessment of the implementation quality
 
+Note: The agent only has access to the spec, so do not dock points for anything not included in the spec (e.g. unit tests, documentation, etc.). If something is included in the spec but not in the changes, you should give a lower score.
+
 Focus on:
 - Correctness and completeness compared to the ground truth changes
 - Quality of the code produced