DRAFT: swtbench: tighten default prompt to discourage non-test edits (option 1 of #708) by juanmichelini · Pull Request #710 · OpenHands/benchmarks

juanmichelini · 2026-05-13T02:12:26Z

Addresses option 1 from #708.

What

Tightens the existing SWT-bench prompt at benchmarks/swtbench/prompts/default.j2 so the agent stops modifying source code and scratch files alongside its tests.

The previous default prompt was self-contradictory:

it said "DON'T have to modify the actual logic and ONLY have to update test logic and tests"…
…but then step 2 said "Create a script reproduction.py" and step 3 said "Edit the sourcecode of the repo to integrate your reproduction script into the test framework".

That's a lot of the failure mode on qwen3-coder-next (#708 found 78% of patches touch source or scratch files; the test-only subset solves at 45.5%, mixed at 1.2%).

This PR edits default.j2 to:

add an IMPORTANT block stating exactly which paths are scored and why touching source files silences the F2P signal;
ban scratch files at the repo root (reproduction.py, test_repro.py, FIX_SUMMARY.md), and build/, docs/, pyproject.toml, etc.;
replace step 2 (create reproduction.py) with "use BashTool to confirm the buggy behavior";
replace step 3 (edit sourcecode) with "edit only existing test files inside the test directory";
add a final step: run git diff --name-only and revert anything outside the test directory.

Diff is 13 added / 4 removed lines, no new files.

Scope

Single file changed: benchmarks/swtbench/prompts/default.j2.
Applies to every SWT-bench run that doesn't pass --prompt-path explicitly — including the SDK run-eval.yml, which doesn't thread --prompt-path through, so the change activates automatically once merged.

How to test

Re-run any SWT-bench config — no extra flags needed. Compare the resulting output.swtbench.jsonl patch-shape distribution to the baseline run linked in #708: the "mixed test + source" share should drop substantially. Score lift only materializes if the agent actually follows the instructions; if it doesn't, #711 (post-processing strip) is the safety net.

Risks / caveats

This is a behavior nudge to every model running SWT-bench, not just qwen3-coder-next. Models that were already test-only are unaffected by the new rules. Models that produced "useful" source-code fixes for some other reason would lose them — but those edits aren't scored anyway, so the only downside is the agent might try harder to comply and waste more turns on test edits. Acceptable IMO; happy to revisit if a strong model regresses.
Wording is intentionally direct ("DO NOT", "earns zero credit"); if you'd rather keep the tone softer let me know and I'll soften it.
Does NOT address the agent that ignores instructions entirely — that case is what DRAFT: swtbench: strip non-test files from model_patch in post-processing (option 2 of #708) #711 handles.

Refs #708. Supersedes the earlier version of this PR that added a separate qwen3_coder_next.j2 template.

This PR was created by an AI agent (OpenHands) on behalf of @juanmichelini.

The default prompt asked the agent to 'DON'T have to modify the actual logic' but in the same breath instructed it to 'Create a script reproduction.py' and 'Edit the sourcecode of the repo to integrate your reproduction script into the test framework'. Those last two steps are at odds with the real scoring rule (only diffs against existing test files count) and they explain a lot of the bad behavior seen on qwen3-coder-next: 78% of patches end up touching source code or scratch files, dropping the solve rate from 45.5% (test-only) to 1.2% (mixed). This rewrites the prompt to: - spell out exactly which paths are scored and why touching source files silences the F2P signal; - ban scratch files at the repo root (reproduction.py, FIX_SUMMARY.md, root-level test_*.py) and ban build/, docs/, pyproject.toml etc.; - replace the 'create reproduction.py + edit sourcecode' steps with 'run throwaway code via BashTool' + 'edit only existing test files inside the test directory'; - add a final 'git diff --name-only and revert anything outside the test directory' step. Edits the default template directly so the change applies to every SWT-bench run (including the SDK run-eval workflow, which does not thread --prompt-path through). See #708. Co-authored-by: openhands <openhands@all-hands.dev>

juanmichelini added build-swt-bench Build SWT-Bench images based on SDK version on this PR. investigation labels May 13, 2026 — with OpenHands AI

juanmichelini added the build-swt-bench Build SWT-Bench images based on SDK version on this PR. label May 13, 2026

This was referenced May 13, 2026

DRAFT: swtbench: strip non-test files from model_patch in post-processing (option 2 of #708) #711

Draft

swtbench: qwen3-coder-next score is artificially low — agent writes source-code fix alongside the test (78% of patches) #708

Open

juanmichelini force-pushed the swtbench-qwen3-coder-next-prompt branch from ab5127f to 6e1dfbd Compare May 13, 2026 02:28

juanmichelini changed the title ~~DRAFT: swtbench: add stricter test-only prompt for qwen3-coder-next (option 1 of #708)~~ DRAFT: swtbench: tighten default prompt to discourage non-test edits (option 1 of #708) May 13, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DRAFT: swtbench: tighten default prompt to discourage non-test edits (option 1 of #708)#710

DRAFT: swtbench: tighten default prompt to discourage non-test edits (option 1 of #708)#710
juanmichelini wants to merge 1 commit into
mainfrom
swtbench-qwen3-coder-next-prompt

juanmichelini commented May 13, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

juanmichelini commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Scope

How to test

Risks / caveats

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

juanmichelini commented May 13, 2026 •

edited

Loading