Skip to content

DRAFT: swtbench: tighten default prompt to discourage non-test edits (option 1 of #708)#710

Draft
juanmichelini wants to merge 1 commit into
mainfrom
swtbench-qwen3-coder-next-prompt
Draft

DRAFT: swtbench: tighten default prompt to discourage non-test edits (option 1 of #708)#710
juanmichelini wants to merge 1 commit into
mainfrom
swtbench-qwen3-coder-next-prompt

Conversation

@juanmichelini
Copy link
Copy Markdown
Collaborator

@juanmichelini juanmichelini commented May 13, 2026

Addresses option 1 from #708.

What

Tightens the existing SWT-bench prompt at benchmarks/swtbench/prompts/default.j2 so the agent stops modifying source code and scratch files alongside its tests.

The previous default prompt was self-contradictory:

  • it said "DON'T have to modify the actual logic and ONLY have to update test logic and tests"
  • …but then step 2 said "Create a script reproduction.py" and step 3 said "Edit the sourcecode of the repo to integrate your reproduction script into the test framework".

That's a lot of the failure mode on qwen3-coder-next (#708 found 78% of patches touch source or scratch files; the test-only subset solves at 45.5%, mixed at 1.2%).

This PR edits default.j2 to:

  • add an IMPORTANT block stating exactly which paths are scored and why touching source files silences the F2P signal;
  • ban scratch files at the repo root (reproduction.py, test_repro.py, FIX_SUMMARY.md), and build/, docs/, pyproject.toml, etc.;
  • replace step 2 (create reproduction.py) with "use BashTool to confirm the buggy behavior";
  • replace step 3 (edit sourcecode) with "edit only existing test files inside the test directory";
  • add a final step: run git diff --name-only and revert anything outside the test directory.

Diff is 13 added / 4 removed lines, no new files.

Scope

  • Single file changed: benchmarks/swtbench/prompts/default.j2.
  • Applies to every SWT-bench run that doesn't pass --prompt-path explicitly — including the SDK run-eval.yml, which doesn't thread --prompt-path through, so the change activates automatically once merged.

How to test

Re-run any SWT-bench config — no extra flags needed. Compare the resulting output.swtbench.jsonl patch-shape distribution to the baseline run linked in #708: the "mixed test + source" share should drop substantially. Score lift only materializes if the agent actually follows the instructions; if it doesn't, #711 (post-processing strip) is the safety net.

Risks / caveats

  • This is a behavior nudge to every model running SWT-bench, not just qwen3-coder-next. Models that were already test-only are unaffected by the new rules. Models that produced "useful" source-code fixes for some other reason would lose them — but those edits aren't scored anyway, so the only downside is the agent might try harder to comply and waste more turns on test edits. Acceptable IMO; happy to revisit if a strong model regresses.
  • Wording is intentionally direct ("DO NOT", "earns zero credit"); if you'd rather keep the tone softer let me know and I'll soften it.
  • Does NOT address the agent that ignores instructions entirely — that case is what DRAFT: swtbench: strip non-test files from model_patch in post-processing (option 2 of #708) #711 handles.

Refs #708. Supersedes the earlier version of this PR that added a separate qwen3_coder_next.j2 template.


This PR was created by an AI agent (OpenHands) on behalf of @juanmichelini.

The default prompt asked the agent to 'DON'T have to modify the actual
logic' but in the same breath instructed it to 'Create a script
reproduction.py' and 'Edit the sourcecode of the repo to integrate
your reproduction script into the test framework'. Those last two
steps are at odds with the real scoring rule (only diffs against
existing test files count) and they explain a lot of the bad behavior
seen on qwen3-coder-next: 78% of patches end up touching source code
or scratch files, dropping the solve rate from 45.5% (test-only) to
1.2% (mixed).

This rewrites the prompt to:
- spell out exactly which paths are scored and why touching source
  files silences the F2P signal;
- ban scratch files at the repo root (reproduction.py, FIX_SUMMARY.md,
  root-level test_*.py) and ban build/, docs/, pyproject.toml etc.;
- replace the 'create reproduction.py + edit sourcecode' steps with
  'run throwaway code via BashTool' + 'edit only existing test files
  inside the test directory';
- add a final 'git diff --name-only and revert anything outside the
  test directory' step.

Edits the default template directly so the change applies to every
SWT-bench run (including the SDK run-eval workflow, which does not
thread --prompt-path through). See #708.

Co-authored-by: openhands <openhands@all-hands.dev>
@juanmichelini juanmichelini force-pushed the swtbench-qwen3-coder-next-prompt branch from ab5127f to 6e1dfbd Compare May 13, 2026 02:28
@juanmichelini juanmichelini changed the title DRAFT: swtbench: add stricter test-only prompt for qwen3-coder-next (option 1 of #708) DRAFT: swtbench: tighten default prompt to discourage non-test edits (option 1 of #708) May 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

build-swt-bench Build SWT-Bench images based on SDK version on this PR. investigation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants