DRAFT: swtbench: strip non-test files from model_patch in post-processing (option 2 of #708) by juanmichelini · Pull Request #711 · OpenHands/benchmarks

juanmichelini · 2026-05-13T02:12:50Z

Addresses option 2 from #708.

What

Extends the existing setup-file strip in benchmarks/swtbench/eval_infer.py with a positive whitelist that keeps only test-file diffs in model_patch. SWT-bench scores nothing else, and non-test diffs in model_patch actively hurt the score because the agent's own source "fix" runs against the test and silences the F2P signal.

Changes

benchmarks/utils/patch_utils.py: new keep_only_test_files(git_patch) helper that mirrors the structure of the existing remove_files_from_patch. A file is kept iff its path lives under tests/, test/, testing/, or its basename matches test_*.py / *_test.py / conftest.py. Repo-root paths are intentionally rejected so agent-authored scratch files (reproduction.py, test_repro.py, FIX_SUMMARY.md, …) are dropped even when they look testy.
benchmarks/swtbench/eval_infer.py: call keep_only_test_files immediately after the existing remove_files_from_patch(setup_files) call. Same place, same shape as the tox.ini/pyproject.toml/setup.py strip.
tests/test_patch_utils_keep_test_files.py: unit tests covering tests-dir, testing-dir, conftest, suffix/prefix names, repo-root scratch rejection, build/docs rejection, and the all-non-test case.

Why

From run litellm_proxy-openrouter-qwen-qwen3-coder-next/24103435463 (424 instances), cross-tabbing patch shape against the resolved set:

Patch category	# instances	Resolved	Rate
Only test files	22	10	45.5 %
Mixed test + source code	332	4	1.2 %
Only source code / no test	64	0	0 %
Empty	3	0	0 %
Total	424	14	3.3 %

Stripping non-test diffs converts most mixed patches into the "only test files" shape — the population that already resolves at 45.5 %. Not every mixed patch will recover (some tests are calibrated to the agent's own fix or import symbols the agent added), but even a conservative recovery is multiples of the current score.

How to test

No re-inference required. Run benchmarks/swtbench/eval_infer.py against the existing output.jsonl from the bad run; the regenerated output.swtbench.jsonl will have only test-file diffs and the harness will rescore accordingly.

Unit tests:

pytest tests/test_patch_utils_keep_test_files.py

Risks / caveats

Heuristic, not ground truth. The whitelist is path-based. A more precise version would intersect with the gold test_patch file set per instance; deferred to a follow-up to keep this PR minimal.
Doesn't fix the root cause — the agent still wastes tokens on source-code fixes. Option 1 (prompt change, opened separately as DRAFT: swtbench: tighten default prompt to discourage non-test edits (option 1 of #708) #710) is complementary, not an alternative.
Existing runs change behavior. Any past SWT-bench run rescored on this code will see different numbers. Numbers reported on the benchmark monitor will need a re-eval pass to be consistent. Existing output.swtbench.jsonl files in completed runs are untouched on disk; only future invocations of eval_infer.py change.
The new test file lives at tests/test_patch_utils_keep_test_files.py; happy to move/rename if the repo has a different convention I missed.

Refs #708.

This PR was created by an AI agent (OpenHands) on behalf of @juanmichelini.

@juanmichelini can click here to continue refining the PR

SWT-bench only scores diffs against existing test files. Today's post-processing in eval_infer.py removes a small blacklist of setup files (pyproject.toml, tox.ini, setup.py) but leaves everything else alone. That means any source-code 'fix' the agent committed alongside its test, plus scratch artifacts like reproduction.py, FIX_SUMMARY.md, build/lib/*, and docs/ changes, all flow through into model_patch. Source-code diffs are particularly harmful: when SWT-bench applies model_patch and runs the new test, the test is now executing against the agent's own 'fixed' code rather than the buggy code, so the F2P signal is silenced and the instance is marked unresolved. Empirically (run litellm_proxy-openrouter-qwen-qwen3-coder-next/24103435463, 424 instances): Only test files in patch : 22 instances, 10 resolved (45.5%) Mixed test + source code in patch : 332 instances, 4 resolved ( 1.2%) Only source code in patch : 64 instances, 0 resolved ( 0.0%) Empty patch : 3 instances, 0 resolved ( 0.0%) Total : 424 instances, 14 resolved ( 3.3%) This adds keep_only_test_files() in benchmarks/utils/patch_utils.py that mirrors remove_files_from_patch() and applies the inverse predicate: keep diffs whose target file lives under tests/, test/, testing/, or whose basename matches test_*.py / *_test.py / conftest.py. Repo-root paths are intentionally rejected so the agent's own root-level scratch test_repro.py files are dropped too. eval_infer.py applies this immediately after the existing setup-file strip, so all existing SWT-bench runs can be rescored from output.jsonl without re-running inference. See #708. Co-authored-by: openhands <openhands@all-hands.dev>

juanmichelini added build-swt-bench Build SWT-Bench images based on SDK version on this PR. investigation labels May 13, 2026 — with OpenHands AI

juanmichelini added the build-swt-bench Build SWT-Bench images based on SDK version on this PR. label May 13, 2026

This was referenced May 13, 2026

swtbench: qwen3-coder-next score is artificially low — agent writes source-code fix alongside the test (78% of patches) #708

Open

DRAFT: swtbench: tighten default prompt to discourage non-test edits (option 1 of #708) #710

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DRAFT: swtbench: strip non-test files from model_patch in post-processing (option 2 of #708)#711

DRAFT: swtbench: strip non-test files from model_patch in post-processing (option 2 of #708)#711
juanmichelini wants to merge 1 commit into
mainfrom
swtbench-strip-non-test-files

juanmichelini commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

juanmichelini commented May 13, 2026

What

Changes

Why

How to test

Risks / caveats

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants