Skip to content

DRAFT: swtbench: strip non-test files from model_patch in post-processing (option 2 of #708)#711

Draft
juanmichelini wants to merge 1 commit into
mainfrom
swtbench-strip-non-test-files
Draft

DRAFT: swtbench: strip non-test files from model_patch in post-processing (option 2 of #708)#711
juanmichelini wants to merge 1 commit into
mainfrom
swtbench-strip-non-test-files

Conversation

@juanmichelini
Copy link
Copy Markdown
Collaborator

Addresses option 2 from #708.

What

Extends the existing setup-file strip in benchmarks/swtbench/eval_infer.py with a positive whitelist that keeps only test-file diffs in model_patch. SWT-bench scores nothing else, and non-test diffs in model_patch actively hurt the score because the agent's own source "fix" runs against the test and silences the F2P signal.

Changes

  • benchmarks/utils/patch_utils.py: new keep_only_test_files(git_patch) helper that mirrors the structure of the existing remove_files_from_patch. A file is kept iff its path lives under tests/, test/, testing/, or its basename matches test_*.py / *_test.py / conftest.py. Repo-root paths are intentionally rejected so agent-authored scratch files (reproduction.py, test_repro.py, FIX_SUMMARY.md, …) are dropped even when they look testy.
  • benchmarks/swtbench/eval_infer.py: call keep_only_test_files immediately after the existing remove_files_from_patch(setup_files) call. Same place, same shape as the tox.ini/pyproject.toml/setup.py strip.
  • tests/test_patch_utils_keep_test_files.py: unit tests covering tests-dir, testing-dir, conftest, suffix/prefix names, repo-root scratch rejection, build/docs rejection, and the all-non-test case.

Why

From run litellm_proxy-openrouter-qwen-qwen3-coder-next/24103435463 (424 instances), cross-tabbing patch shape against the resolved set:

Patch category # instances Resolved Rate
Only test files 22 10 45.5 %
Mixed test + source code 332 4 1.2 %
Only source code / no test 64 0 0 %
Empty 3 0 0 %
Total 424 14 3.3 %

Stripping non-test diffs converts most mixed patches into the "only test files" shape — the population that already resolves at 45.5 %. Not every mixed patch will recover (some tests are calibrated to the agent's own fix or import symbols the agent added), but even a conservative recovery is multiples of the current score.

How to test

No re-inference required. Run benchmarks/swtbench/eval_infer.py against the existing output.jsonl from the bad run; the regenerated output.swtbench.jsonl will have only test-file diffs and the harness will rescore accordingly.

Unit tests:

pytest tests/test_patch_utils_keep_test_files.py

Risks / caveats

  • Heuristic, not ground truth. The whitelist is path-based. A more precise version would intersect with the gold test_patch file set per instance; deferred to a follow-up to keep this PR minimal.
  • Doesn't fix the root cause — the agent still wastes tokens on source-code fixes. Option 1 (prompt change, opened separately as DRAFT: swtbench: tighten default prompt to discourage non-test edits (option 1 of #708) #710) is complementary, not an alternative.
  • Existing runs change behavior. Any past SWT-bench run rescored on this code will see different numbers. Numbers reported on the benchmark monitor will need a re-eval pass to be consistent. Existing output.swtbench.jsonl files in completed runs are untouched on disk; only future invocations of eval_infer.py change.
  • The new test file lives at tests/test_patch_utils_keep_test_files.py; happy to move/rename if the repo has a different convention I missed.

Refs #708.


This PR was created by an AI agent (OpenHands) on behalf of @juanmichelini.

@juanmichelini can click here to continue refining the PR

SWT-bench only scores diffs against existing test files. Today's
post-processing in eval_infer.py removes a small blacklist of setup
files (pyproject.toml, tox.ini, setup.py) but leaves everything else
alone. That means any source-code 'fix' the agent committed alongside
its test, plus scratch artifacts like reproduction.py, FIX_SUMMARY.md,
build/lib/*, and docs/ changes, all flow through into model_patch.

Source-code diffs are particularly harmful: when SWT-bench applies
model_patch and runs the new test, the test is now executing against
the agent's own 'fixed' code rather than the buggy code, so the F2P
signal is silenced and the instance is marked unresolved.

Empirically (run litellm_proxy-openrouter-qwen-qwen3-coder-next/24103435463,
424 instances):

  Only test files in patch          : 22  instances, 10 resolved (45.5%)
  Mixed test + source code in patch : 332 instances,  4 resolved ( 1.2%)
  Only source code in patch         : 64  instances,  0 resolved ( 0.0%)
  Empty patch                       :  3  instances,  0 resolved ( 0.0%)
  Total                             : 424 instances, 14 resolved ( 3.3%)

This adds keep_only_test_files() in benchmarks/utils/patch_utils.py
that mirrors remove_files_from_patch() and applies the inverse
predicate: keep diffs whose target file lives under tests/, test/,
testing/, or whose basename matches test_*.py / *_test.py /
conftest.py. Repo-root paths are intentionally rejected so the agent's
own root-level scratch test_repro.py files are dropped too.

eval_infer.py applies this immediately after the existing setup-file
strip, so all existing SWT-bench runs can be rescored from
output.jsonl without re-running inference. See #708.

Co-authored-by: openhands <openhands@all-hands.dev>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

build-swt-bench Build SWT-Bench images based on SDK version on this PR. investigation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants