Skip to content

feat: ProgramBench reverse-engineering environment#1351

Draft
sethkarten wants to merge 4 commits into
mainfrom
feat/programbench-env
Draft

feat: ProgramBench reverse-engineering environment#1351
sethkarten wants to merge 4 commits into
mainfrom
feat/programbench-env

Conversation

@sethkarten
Copy link
Copy Markdown

@sethkarten sethkarten commented May 12, 2026

Summary

  • Adds environments/programbench/ — a verifiers.v1 RLM environment where an agent reconstructs source code from compiled binaries, scored by fraction of pytest tests passed
  • End-to-end pipeline validated: setup → agent turn → scoring with real test execution
  • Smoke test result: 17/36 tests passed (reward=0.472) on antonmedv__fx.86d0d34 with gpt-4.1-mini

Key design decisions

  • Language images: golang:1.22-bookworm, gcc:13-bookworm, custom Rust image with pre-warmed cargo registry
  • pytest installed via pip --break-system-packages after apt-get update (Debian 12 / PEP 668 requirement)
  • Test archives hidden from agent during rollout, restored at scoring time (follows rlm_swe_v1 pattern)
  • Scoring always re-runs compile.sh with PATH prepended so agents that omit PATH export still get correct scores
  • Dataset: PrimeIntellect/programbench-processed (private HF); 5-task Go smoke subset in data/go_subset.jsonl

Test plan

  • prime env install programbench succeeds
  • vf-eval programbench -a '{"filter_language":"go","max_tasks":1}' -n 1 -r 1 -d -v returns non-zero reward
  • Full 5-task Go run: vf-eval programbench -a '{"filter_language":"go"}' -n 5 -r 1
  • Binary preprocessing on Linux (scripts/build_binaries_linux.sh) to enable agent probing
  • Rust custom Docker image build + push (docker/build.sh)

Pending follow-up

  • Linux binary build pipeline (populates empty binary_hf_repo/binary_hf_filename fields)
  • Rust toolchain image push to DockerHub
  • Full 200-task eval once binaries are available

🤖 Generated with Claude Code

Seth and others added 4 commits May 11, 2026 17:46
Verifiers.v1 environment where an RLM agent reconstructs source code
from compiled binaries, scored by fraction of pytest tests passed.

Key implementation decisions:
- golang:1.22-bookworm / gcc:13-bookworm for standard langs; custom
  Rust image with pre-warmed cargo registry for Rust tasks
- apt-get update + pip --break-system-packages required for pytest on
  Debian 12 (PEP 668 restriction)
- Test archives hidden from agent at setup; uploaded at scoring time
  following the rlm_swe_v1 pattern
- Scoring re-runs compile.sh with PATH prepended for toolchain binaries
  so agents that omit PATH export still get scored correctly
- go_subset.jsonl contains 5 smoke-test tasks; full dataset on
  PrimeIntellect/programbench-processed (HF private)

Smoke test: 17/36 tests passed (reward=0.472) with gpt-4.1-mini on
antonmedv__fx.86d0d34, confirming end-to-end pipeline works.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant