feat: ProgramBench reverse-engineering environment by sethkarten · Pull Request #1351 · PrimeIntellect-ai/verifiers

sethkarten · 2026-05-12T00:47:17Z

Summary

Adds environments/programbench/ — a verifiers.v1 RLM environment where an agent reconstructs source code from compiled binaries, scored by fraction of pytest tests passed
End-to-end pipeline validated: setup → agent turn → scoring with real test execution
Smoke test result: 17/36 tests passed (reward=0.472) on antonmedv__fx.86d0d34 with gpt-4.1-mini

Key design decisions

Language images: golang:1.22-bookworm, gcc:13-bookworm, custom Rust image with pre-warmed cargo registry
pytest installed via pip --break-system-packages after apt-get update (Debian 12 / PEP 668 requirement)
Test archives hidden from agent during rollout, restored at scoring time (follows rlm_swe_v1 pattern)
Scoring always re-runs compile.sh with PATH prepended so agents that omit PATH export still get correct scores
Dataset: PrimeIntellect/programbench-processed (private HF); 5-task Go smoke subset in data/go_subset.jsonl

Test plan

prime env install programbench succeeds
vf-eval programbench -a '{"filter_language":"go","max_tasks":1}' -n 1 -r 1 -d -v returns non-zero reward
Full 5-task Go run: vf-eval programbench -a '{"filter_language":"go"}' -n 5 -r 1
Binary preprocessing on Linux (scripts/build_binaries_linux.sh) to enable agent probing
Rust custom Docker image build + push (docker/build.sh)

Pending follow-up

Linux binary build pipeline (populates empty binary_hf_repo/binary_hf_filename fields)
Rust toolchain image push to DockerHub
Full 200-task eval once binaries are available

🤖 Generated with Claude Code

Verifiers.v1 environment where an RLM agent reconstructs source code from compiled binaries, scored by fraction of pytest tests passed. Key implementation decisions: - golang:1.22-bookworm / gcc:13-bookworm for standard langs; custom Rust image with pre-warmed cargo registry for Rust tasks - apt-get update + pip --break-system-packages required for pytest on Debian 12 (PEP 668 restriction) - Test archives hidden from agent at setup; uploaded at scoring time following the rlm_swe_v1 pattern - Scoring re-runs compile.sh with PATH prepended for toolchain binaries so agents that omit PATH export still get scored correctly - go_subset.jsonl contains 5 smoke-test tasks; full dataset on PrimeIntellect/programbench-processed (HF private) Smoke test: 17/36 tests passed (reward=0.472) with gpt-4.1-mini on antonmedv__fx.86d0d34, confirming end-to-end pipeline works. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…2a, fasttext, brotli, halite, blake3)

…ces fragile shell heredocs)

Seth and others added 4 commits May 11, 2026 17:46

feat(programbench): add multi-language preprocess + build scripts

1dddfeb

fix(programbench): add language overrides for misclassified repos (jp…

7a566db

…2a, fasttext, brotli, halite, blake3)

feat(programbench): Python-based multi-language binary builder (repla…

1ffbf1d

…ces fragile shell heredocs)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: ProgramBench reverse-engineering environment#1351

feat: ProgramBench reverse-engineering environment#1351
sethkarten wants to merge 4 commits into
mainfrom
feat/programbench-env

sethkarten commented May 12, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sethkarten commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key design decisions

Test plan

Pending follow-up

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

sethkarten commented May 12, 2026 •

edited

Loading