Turn any repository into an RL environment for training and evaluation.
⚠️ Experimental. This is a research project in active development. APIs, spec fields, and CLI flags change between minor versions. Pin to a specific release if you depend on it; expect breaking changes onmain.
Repo2RLEnv synthesizes verifiable data from existing repositories using pluggable pipelines, exports it into a uniform spec, and pushes straight to the Hugging Face Hub. End-to-end — synthesis → standardize → train + eval — with the focus on training. The uniform spec is Harbor's, so the datasets you produce drop straight into any Harbor-compatible runtime.
╭──────────────╮ ╭──────────────╮ ╭──────────────╮ ╭──────────────────╮
│ any │ ──▶ │ synthesize │ ──▶ │ uniform spec │ ──▶ │ train · eval · │
│ repo │ │ (pipelines) │ │ (Harbor) │ │ push to HF Hub │
╰──────────────╯ ╰──────────────╯ ╰──────────────╯ ╰──────────────────╯
└──────────────────────── Repo2RLEnv ────────────────────────┘
# Install (pick one)
uv add repo2rlenv # add to a uv-managed project
uvx repo2rlenv --help # one-shot, no install
pip install repo2rlenv # classic
# Auth: nothing to set up if you've done `gh auth login` and `huggingface-cli login`
# Otherwise: export GITHUB_TOKEN=... ; export HF_TOKEN=...
# Generate a dataset locally
repo2rlenv generate \
--repo <owner>/<repo> \
--pipeline pr_diff \
--pipeline-opt limit=5 \
--llm anthropic/claude-sonnet-4-6 \
--out ./datasets/<dataset-name>
# Or push straight to HF Hub with --out hf://<your-org>/<dataset-name>
# Validate a local dataset against the spec
repo2rlenv validate ./path/to/dataset
# Score a candidate diff against a task's oracle (diff-similarity reward)
repo2rlenv reward --task ./datasets/<dataset-name>/<task-id> --prediction ./candidate.diff
# Or write a sample config first and use --config
repo2rlenv init && repo2rlenv generate --config repo2rlenv.config.yamlFull walkthrough in docs/quickstart.md.
Different methods to manufacture verifiable tasks from a repo. Pick one, run it, push the dataset.
| Pipeline | What it does | Sandbox | LLM | Supported languages | Inspiration | Docs |
|---|---|---|---|---|---|---|
pr_diff |
Mine merged PR diffs (text only, no execution) | — | — | any | SWE-RL | 📄 |
pr_runtime |
Mine merged PRs; sandbox-verify F2P/P2P oracle | ✅ | ✅ | Py · Node · Go · Rust | SWE-bench | 📄 |
pr_stream |
Continuous PR mining (watermark-based, monthly cron) | ✅ | ✅ | Py · Node · Go · Rust | SWE-bench-Live | 📄 |
commit_runtime |
Commit-level mining (bypass PR-review filters) | ✅ | ✅ | Py · Node · Go · Rust | R2E-Gym SWE-GEN | 📄 |
mutation_bugs |
Inject bugs via AST mutations; tests must break | ✅ | ✅ | Py only | SWE-smith | 📄 |
code_instruct |
Repo-anchored OSS-Instruct with executable verifiers | ✅ | ✅ | Py only | Magicoder / OSS-Instruct | 📄 |
equivalence_tests |
Extract a function; LLM writes equivalence tests | ✅ | ✅ | Py only | R2E | 📄 |
cve_patches |
Map OSV CVEs to fix commits in the target repo | ✅ | ✅ | Py · Node · Go · Rust | PatchSeeker / CVE-Bench | 📄 |
refactor_synthesis |
Mine rename refactors from commit history | ✅ | ✅ | Py only | Python-native (drops RefactoringMiner JVM dep) | 📄 |
Python repos exercise all 9 pipelines; other supported languages exercise the 5 language-agnostic ones. Polyglot mutation + non-Python synthesis are on the v0.9 roadmap.
Every pipeline flows through the same QA gate (determinism, oracle consistency, LLM judge, false-negative filter) before tasks are admitted to a dataset. Text-only pipelines skip the heavy QA layers since there's no execution to validate. See docs/pipelines/README.md for reward kinds + GPU requirements.
Pipelines marked with a sandbox ✅ above need a working Docker environment for the target repo before they can run. Repo2RLEnv's bootstrap phase handles this automatically — an LLM agent iterates shell commands inside a fresh Docker container until the repo builds and the test suite collects. The working image is committed, content-addressed, and cached, so the expensive env-construction step runs once per (repo, ref) and every downstream task reuses it. Pure text pipelines (pr_diff) skip it entirely.
You don't normally invoke it directly — repo2rlenv generate --pipeline pr_runtime ... auto-triggers a cache lookup and runs bootstrap on miss. But you can pre-warm it or use it standalone for debugging:
repo2rlenv bootstrap \
--repo <owner>/<repo> \
--llm anthropic/claude-sonnet-4-6Full design + cache layout + cost-tracking + spec extension fields: docs/reference/BOOTSTRAP.md.
A dataset format that:
- Is verifiable — every task carries either an executable test (
test_execution) or a stored oracle diff (diff_similarity); your trainer picks the reward type - Is content-addressed —
content_hashover each task; same artifacts ⇒ same hash - Trains anywhere via Harbor — TRL, SkyRL, Prime-RL, Tinker, Miles, Slime, harbor.rl
- Evaluates with any agent harness — Claude Code, OpenHands, Codex CLI, Gemini CLI, …
- Is language-agnostic by spec —
_runtimepipelines emit Dockerfile + shell verifier;_diffpipelines are pure text and work for any language with no extra config - Publishes natively to Hugging Face Hub —
--out hf://owner/namewrites a Harbor-compatibleregistry.jsonso consumers canharbor downloadwithout any glue - Supports private repos end-to-end —
gh auth tokenresolved automatically; build secrets declared by name; verifier-time secrets forbidden by spec
Repo2RLEnv emits datasets in the Harbor task format. We don't ship our own sandbox, agent harness, or registry — Harbor already has those. We focus on synthesis: turning a real repo into verifiable, reproducible Harbor tasks. A small [metadata.repo2env] extension inside Harbor's task.toml carries provenance (pipeline name, base commit, PR URL, content hash, reward kinds, etc.).
By targeting Harbor we inherit its full stack: Local Docker / Modal / Daytona / E2B / Runloop sandboxes, every major coding-agent harness, parallel execution, the publishing CLI, and downstream hooks for OpenReward (which adds Miles, Slime to the trainer list).
Docs are organized into three tiers — see docs/README.md for the index.
- 🚀
docs/quickstart.md— install → first dataset → push to Hub, in 10 minutes - 📖
docs/pipelines/— one page per synthesis pipeline (when to use, oracle shape, inspiration) - 📚 Reference contracts and module-level API:
reference/SPEC.md— input/output contractreference/API.md— Python API forsrc/repo2rlenv/reference/AUTH.md— GitHub / HF / LLM auth resolutionreference/BOOTSTRAP.md— LLM-iterated per-repo Docker imagereference/AGENTS.md— Harbor agent harnesses + RL trace plumbing
- 🛠
CONTRIBUTING.md— dev setup, PR conventions, commit style, release flow - 🧪
contributing/ADDING_A_PIPELINE.md— step-by-step cookbook for shipping a new pipeline
Beyond the per-pipeline inspirations linked in the table above, Repo2RLEnv builds on or adjacent to:
- Harbor — the task format + runtime ecosystem we adopt as our output spec
- RepoLaunch (Microsoft) — LLM-agent-driven environment setup; our
bootstrapis an independent reimplementation - OpenReward — ORS protocol + extra trainer integrations layered above Harbor
- SWE-Gym — RL-environment framing for SWE-bench-style tasks
- SWE-Bench++ — four-stage QA pipeline we'll re-implement
- verifiers (Prime Intellect), OpenEnv (Meta + HF) — adjacent standardization efforts
Every pipeline that draws from external work carries an Acknowledgment block in its .py file. No code is copied — implementations are independent and licensed Apache-2.0.
Pre-alpha.
- v0.1.0 shipped on PyPI:
pr_diff+ HF Hub publish + diff-similarity reward, end-to-end on any GitHub repo (public or private). - v0.2: bootstrap phase (LLM-driven Docker env), unified Rich UI, content-addressed cache, registry-qualified pullable digests. (rolled into v0.3 release)
- v0.3.0 shipped on PyPI:
pr_runtimepipeline (sandbox-verified PR mining withFAIL_TO_PASS/PASS_TO_PASSoracle), auto-triggered bootstrap, structural quality filters, targeted test invocation. - v0.4.0 shipped on PyPI: polyglot log parsers (Go / Cargo / Jest), Harbor end-to-end verification (Mean reward 1.0 on Go via
urfave/cli). - v0.5:
pr_stream(continuous PR mining, watermark-based) +commit_runtime(commit-level mining, SWE-GEN style); defensive git install in emitted Dockerfile so any bootstrap base image works. Harbor-verified on both. (rolled into v0.6 release) - v0.6.0 shipped on PyPI: first LLM-synthesized pipelines —
mutation_bugs(AST-based bug injection inspired by SWE-smith) +code_instruct(repo-anchored OSS-Instruct inspired by Magicoder, with executable verifiers). Harbor-verified onpallets/click(Mean reward 1.000 on both). 271/271 tests passing. - v0.7.0 shipped on PyPI:
equivalence_tests(R2E-style function-level synthesis — extract a real function, LLM writes equivalence tests, gold patch fills in the candidate with the original) +cve_patches(OSV-driven security-fix mining — CVE → fix commit → Harbor task). Harbor-verified onpallets/clickandpallets/werkzeug(Mean reward 1.000 on both). - v0.8.0 shipped on PyPI:
refactor_synthesis(Python-native rename-refactor mining — drop the JVM RefactoringMiner dep; commit-message regex + diff verification + multi-criteria verifier). Harbor-verified onpallets/click(Mean reward 1.000). All 8 originally-planned pipelines now shipped. - v0.9 planned: LLM-judged QA gate (SWE-Bench++ four-layer recipe) + iterative refinement for
equivalence_tests+ LLM-synthesized PoC tests forcve_patches+ HF Hub append-mode forpr_stream+ polyglot mutation (Java/JS/Go via tree-sitter) + Extract Method / Inline-function refactor kinds forrefactor_synthesis.
Apache 2.0 — see LICENSE.