Scaffold for agent-focused benchmarking with SWE-bench Multilingual first, then additional benchmark suites.
- SWE-bench Multilingual integration scaffold
- Rust-first task filtering
- Agent adapters (Claude Code, Cursor, noop)
- Run artifacts and metadata for reproducibility
Run these before ./scripts/swebench/phase1_tokio_run.sh.
- System prerequisites:
- Python
>=3.11 git- Docker Desktop / Docker daemon running
claudeCLI installed and authenticatedcursor-agentCLI installed and authenticated
- Python
- Create and activate a virtual environment:
python3 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip- Install Python dependencies:
python -m pip install -e '.[hf]'
python -m pip install swebench- Optional preflight checks:
python -c "import datasets, swebench; print('python deps ok')"
command -v claude
command -v cursor-agent
docker info- Run Phase 1:
./scripts/swebench/phase1_tokio_run.shBy default, Phase 1 now runs Claude and Cursor baselines in parallel and runs up to
2 tasks concurrently per baseline (RUN_MAX_WORKERS=2).
If needed, you can force the script to use a specific interpreter:
PYTHON_BIN=python ./scripts/swebench/phase1_tokio_run.shTuning parallelism:
# Disable parallel Claude/Cursor baseline runs
RUN_AGENTS_IN_PARALLEL=0 ./scripts/swebench/phase1_tokio_run.sh
# Keep agents parallel, but change per-run task concurrency
RUN_MAX_WORKERS=3 ./scripts/swebench/phase1_tokio_run.sh- Pick a config:
configs/swebench/rust_canary.toml(mock agent)configs/swebench/rust_claude_code.toml(Claude Code wrapper)configs/swebench/rust_cursor.toml(Cursor wrapper)configs/swebench/rust_tokio_phase1_claude.toml(Tokio Phase 1)configs/swebench/rust_tokio_phase1_cursor.toml(Tokio Phase 1)
- Export SWE-bench Multilingual data into local JSONL:
python3 -m benchkit.swebench.cli export-hf \
--split test \
--repo tokio-rs/tokio \
--max-instances 3 \
--output datasets/swebench_multilingual.test.tokio.jsonl \
--overwriteSet HF_TOKEN if the dataset requires authenticated access.
The exporter normalizes language aliases (for example tokio-rs -> rust) and
adds repo_label (owner/org prefix from repo) to each row.
3. Inspect a planned run:
python3 -m benchkit.swebench.cli plan --config configs/swebench/rust_canary.toml- Run a dry-run baseline (no real agent call):
python3 -m benchkit.swebench.cli run --config configs/swebench/rust_canary.toml --dry-runRun outputs are written under runs/.
Enable direct SWE-bench harness evaluation after prediction generation with:
[evaluation]
enabled = true
python_bin = "python3"
swebench_repo = "/absolute/path/to/SWE-bench" # optional if swebench is installed
dataset_name = "SWE-bench/SWE-bench_Multilingual"
split = "dev"
max_workers = 4
timeout_seconds = 7200Each attempt will write:
attempts/attempt-XX/evaluation.jsonattempts/attempt-XX/evaluation.stdout.logattempts/attempt-XX/evaluation.stderr.logattempts/attempt-XX/agent:<...>.<run_id>-attempt-XX.json(raw SWE-bench report)
To run against real repo checkouts at each task base_commit, enable in [run]:
condition = "baseline"
include_repos = ["tokio-rs/tokio"] # optional
include_instance_ids = [] # optional
# instance_ids_file = "tokio_task_ids.txt"
prepare_workspace = true
repo_url_template = "https://github.com/{repo}.git"
git_bin = "git"
workspace_timeout_seconds = 600
max_workers = 2Generate appendix files from one or more completed runs:
python3 -m benchkit.swebench.cli appendix \
--run-root runs/swebench_multilingual/<date>/<run_id_1> \
--run-root runs/swebench_multilingual/<date>/<run_id_2> \
--output-dir reports/appendixPhase 1 (2-3 Tokio tasks) flow:
- Copy
configs/swebench/tokio_task_ids.sample.txttoconfigs/swebench/tokio_task_ids.txtand keep only your selected task IDs. - Run
planwith one of therust_tokio_phase1_*.tomlconfigs. - Run
runfor Claude and Cursor baselines. - Run
appendixon both run roots to generate appendix files.
Use model_map in config if an agent expects a different CLI model ID than the
canonical benchmark model name.
plan and run now perform strict agent/model normalization checks and fail
fast on mismatches (example: Cursor with claude-opus-4-6 will error and suggest
opus-4.6).
Run the full Tokio Phase 1 flow in one command:
./scripts/swebench/phase1_tokio_run.shA Streamlit app for browsing bug-report datasets (JSONL/JSON) with a GitHub-style diff viewer for patches.
- Install viewer dependencies:
pip install -r requirements-dataset-viewer.txtOr with the project installed: pip install -e '.[viewer]'.
- Run the app:
streamlit run app.py- In the sidebar, set the dataset path (default:
datasets/swebench_multilingual.test.tokio.jsonl), filter by repo, search problem statements, or jump to a record byinstance_id. The main panel shows metadata, problem statement, hints, and side-by-side patch and test-patch diffs (rendered with diff2html).