High-performance SWE-bench dataset generator and evaluation harness that mines real GitHub pull requests, produces evaluation-ready task instances, and benchmarks coding agents.
Built on top of SweInfinite by @unconst, rewritten in Python with:
- Agentic command discovery (NO hardcoded install commands)
- Language detection (rule-based, OK to hardcode)
- Difficulty filtering with LLM classification
- Docker verification of generated tests
- Full parallelism with semaphore-based concurrency
- Structured LLM outputs via OpenAI function calling
- 200k context auto-compaction with smart summarization
swe-forge connects to GH Archive to discover recently merged pull requests, enriches them via the GitHub API, classifies their difficulty using an LLM, discovers install/test commands agenticly, generates test specifications via an agentic loop, and exports SWE-bench-compatible task instances.
| Feature | Description |
|---|---|
| 🔍 Real GitHub Data | Mines GH Archive for merged PRs across all public repositories |
| 🎯 Difficulty Filtering | Pre-classifies PRs as easy/medium/hard before expensive processing |
| 🤖 Agentic Discovery | Discovers install/test commands from CI/CD (NO hardcoding) |
| 📦 Docker Verification | Verifies tests in Docker before export |
| ⚡ Full Parallelism | GH Archive 8x, enrichment 20x, Docker 8x concurrent |
| 🧠 Smart Compaction | 200k context limit with structured summary templates |
| 📊 Complete Export | workspace.yaml + patch.diff + tests/ directory |
pip install swe-forgegit clone https://github.com/CortexLM/swe-forge.git
cd swe-forge
pip install -e .docker pull ghcr.io/cortexlm/swe-forge:latest# Required environment variables
export GITHUB_TOKEN="ghp_..." # GitHub PAT for PR enrichment
export OPENROUTER_API_KEY="sk-or-v1-..." # OpenRouter API key for LLM# Mine 10 tasks with workspace export
swe-forge mine mine \
--limit 10 \
--output ./tasks.jsonl \
--output-folder ./tasks \
--docker-username myuser \
--parallel 8
# Mine with difficulty filter
swe-forge mine mine \
--limit 5 \
--difficulty hard \
--min-stars 100
# Mine specific repository
swe-forge mine mine \
--repo python/cpython \
--limit 3# Full A-Z pipeline with test verification
swe-forge mine complete \
--repo owner/repo \
--pr 12345 \
--output ./tasks.jsonl \
--model openai/gpt-5.4tasks/
├── owner-repo-1234/
│ ├── workspace.yaml # Complete task configuration
│ ├── patch.diff # PR patch to apply
│ ├── test_patch.diff # Test file changes
│ └── tests/ # Extracted test files
│ ├── test_feature.py
│ └── test_another.py
└── owner-repo-5678/
└── ...
task_id: owner-repo-1234
repo:
url: https://github.com/owner/repo.git
base_commit: abc123def456...
merge_commit: fed456abc123...
language: python
difficulty_score: 5
prompt: "Fix the bug in..."
environment:
image: myuser/swe-forge-tasks:owner-repo-1234
language_version: "3.12"
install:
commands:
- pip install -e .
- pip install pytest
tests:
fail_to_pass:
- pytest tests/test_feature.py -v
- pytest tests/test_another.py::test_case -v
pass_to_pass:
- pytest tests/ -v --ignore=tests/test_feature.py
docker:
image: myuser/swe-forge-tasks:owner-repo-1234
build: trueswe-forge mine mine [OPTIONS]| Option | Short | Default | Description |
|---|---|---|---|
--repo |
-r |
All | Target repository (owner/repo format) |
--limit |
-l |
10 | Maximum tasks to mine |
--output |
-o |
./tasks.jsonl | Output JSONL file |
--output-folder |
-O |
None | Output folder for workspace format |
--docker-username |
-D |
None | Docker Hub username for image names |
--parallel |
-p |
8 | Concurrent Docker containers |
--difficulty |
-d |
All | Filter: easy, medium, hard |
--model |
-m |
moonshotai/kimi-k2.5 | LLM model for classification |
--min-stars |
100 | Minimum repository stars | |
--language |
python | Filter by language | |
--filter |
-f |
{"easy":10,"medium":10,"hard":10} | JSON max tasks per difficulty |
--verbose |
-v |
False | Enable verbose logging |
swe-forge mine complete [OPTIONS]| Option | Short | Default | Description |
|---|---|---|---|
--repo |
-r |
Required | Target repository (owner/repo) |
--pr |
-p |
Required | Pull request number |
--output |
-o |
./tasks.jsonl | Output file |
--model |
-m |
openai/gpt-5.4 | LLM model |
--verbose |
-v |
False | Verbose logging |
sequenceDiagram
participant GHA as GH Archive
participant SF as swe-forge
participant GH as GitHub API
participant LLM as LLM
participant D as Docker
GHA->>SF: Merged PR events (8x concurrent)
SF->>SF: Pre-filter (bots, org, stars)
SF->>GH: Enrich candidates (20x concurrent)
GH-->>SF: PR metadata + diff
SF->>LLM: Classify difficulty
LLM-->>SF: easy / medium / hard
SF->>D: Agentic discovery (8x concurrent)
D-->>SF: fail_to_pass + pass_to_pass
SF->>LLM: Quality scoring
LLM-->>SF: Accept / reject
SF-->>SF: Export workspace.yaml
| Stage | Semaphore | Default | Description |
|---|---|---|---|
| GH Archive Fetch | gh_archive_sem |
8 | Download hourly dumps |
| GitHub Enrichment | enrichment_sem |
20 | Fetch PR metadata (5000/h rate limit) |
| Pre-classification | preclassify_sem |
25 | LLM triage on title+body |
| Deep Processing | deep_sem |
8 | Full pipeline per candidate |
| Docker Containers | docker_sem |
8 | Concurrent test verification |
IMPORTANT: Commands are NEVER hardcoded.
sequenceDiagram
participant AD as Agent Discovery
participant CI as CI/CD Config
participant LLM as LLM
participant SH as Shell (Docker)
AD->>CI: Parse .github/workflows/, .gitlab-ci.yml
CI-->>AD: Install patterns, test commands
AD->>SH: Clone repo in Docker
AD->>LLM: "Discover how to install and test"
loop Up to 200 turns
LLM->>SH: shell("pip install -e .")
SH-->>LLM: exit_code=0
LLM->>SH: shell("pytest tests/")
SH-->>LLM: exit_code=0, output
end
LLM->>AD: submit_tests(fail_to_pass, pass_to_pass)
- Clone repository at base commit
- Detect language from files (package.json, pyproject.toml, Cargo.toml, etc.)
- Discover commands by:
- Parsing CI/CD workflows
- Reading package manager configs
- Trying commands and checking exit codes
- Generate tests via LLM agentic loop
- Verify tests fail before patch (proves bug exists)
- Apply patch
- Verify tests pass after patch (proves fix works)
| Level | Score Range | Typical Changes | Examples |
|---|---|---|---|
| Easy | 0.1 – 0.35 | Typos, config, single-file | Fix import, update version |
| Medium | 0.4 – 0.65 | Bug fixes, features, APIs | Fix race condition, add endpoint |
| Hard | 0.7 – 1.0 | Cross-cutting, architectural | New subsystem, migration |
- Pre-classification:
moonshotai/kimi-k2.5(fast triage on title+body) - Full classification: Uses complete diff and test spec
When context exceeds 200k tokens, the system uses structured summarization:
## Goal
[What goal(s) is the user trying to accomplish?]
## Instructions
- [What important instructions did the user give you]
- [If there is a plan or spec, include information about it]
## Discoveries
[What notable things were learned during this conversation]
## Accomplished
[What work has been completed, in progress, and left?]
## Relevant files / directories
[Structured list of relevant files]This preserves critical context across long agentic sessions.
| Variable | Required | Description |
|---|---|---|
GITHUB_TOKEN |
Yes | GitHub PAT for PR enrichment |
OPENROUTER_API_KEY |
Yes | OpenRouter API key for LLM calls |
HF_TOKEN |
No | HuggingFace token for dataset upload |
RUST_LOG |
No | Log level: debug, info, warn, error |
| Language | Detection | Package Managers |
|---|---|---|
| Python | pyproject.toml, setup.py, requirements.txt | pip, poetry, uv |
| JavaScript/TypeScript | package.json | npm, yarn, pnpm |
| Rust | Cargo.toml | cargo |
| Go | go.mod | go mod |
| Java | pom.xml, build.gradle | maven, gradle |
# Clone and install dev dependencies
git clone https://github.com/CortexLM/swe-forge.git
cd swe-forge
pip install -e ".[dev]"
# Install pre-commit hooks
pre-commit install# Run all tests
pytest tests/ -v
# Run specific test module
pytest tests/test_swe/test_pipeline.py -v
# Run with coverage
pytest tests/ --cov=src/swe_forge --cov-report=html# Format
ruff format src/
# Lint
ruff check src/
# Type check
pyright src/Benchmark run with 100 candidate PRs from GH Archive:
| Stage | Count | Percentage |
|---|---|---|
| Raw GH Archive events (12h) | 1,752,426 | 100% |
| Merged PR events | 35,498 | 2.03% |
| After pre-filter | 1,394 | 3.93% |
| Enriched successfully | 21 | 1.51% |
| Tests generated | 11 | 52.38% |
| Quality passed | 8 | 72.73% |
| Metric | Value |
|---|---|
| Tasks per hour | 8 |
| Avg time per task | 450s |
| Docker parallelism | 8 containers |
from swe_forge.swe.pipeline import SwePipeline, SwePipelineConfig
from swe_forge.export.workspace import export_tasks_to_workspace
# Configure pipeline
config = SwePipelineConfig(
max_candidates=50,
max_tasks=10,
min_stars=100,
languages=["python"],
)
# Run pipeline
async with SwePipeline(config) as pipeline:
result = await pipeline.run()
# Export to workspace format
export_tasks_to_workspace(
result.tasks,
output_folder="./tasks",
docker_username="myuser"
)from swe_forge.swe.models import SweTask
@dataclass
class SweTask:
id: str
repo: str # owner/repo format
base_commit: str # Git SHA
merge_commit: str # Git SHA
language: str # python, rust, etc.
difficulty_score: int # 1-10
patch: str # Unified diff
test_patch: str # Test file changes
fail_to_pass: list[str] # Test commands
pass_to_pass: list[str] # Test commands
install_config: dict # Discovered install commands
prompt: str # Task description
quality_score: float # 0.0-1.0
status: SweTaskStatus # candidate, validated, etc.The published dataset CortexLM/swe-forge on HuggingFace contains task instances with pre-built Docker images.
- Docker installed and running
pip install datasets
# Test a specific task by ID
python scripts/test_task.py --task-id pydantic-pydantic-12985
# Test 5 random tasks
python scripts/test_task.py --random 5
# Test all tasks and save results
python scripts/test_task.py --all --output results.json
# With verbose output
python scripts/test_task.py --task-id pydantic-pydantic-12985 -vOr use the shell wrapper:
./scripts/test_task.sh --random 5Each task is tested in an isolated Docker container:
- Pull Docker image - Contains repo at
base_commit - Run fail_to_pass tests - Should all PASS
- Run pass_to_pass tests - Should all PASS
Pre-built Docker images (platformnetwork/swe-forge:*) contain:
/workspace/patch.diff- The patch/workspace/run_tests.sh- Test script- Repository cloned at
base_commit
| Field | Description |
|---|---|
instance_id |
Task ID (format: owner-repo-123) |
docker_image |
Pre-built Docker image |
fail_to_pass |
Tests that must pass after patch |
pass_to_pass |
Tests that must stay passing |
patch |
Unified diff to apply |
SWE-Forge provides a Docker-based evaluation harness for benchmarking model-generated patches, similar to SWE-bench.
pip install datasets # For HuggingFace dataset loading# Evaluate gold patches (ground truth) on a specific task
python3 scripts/run_evaluation.py --predictions_path gold --instance_ids pydantic-pydantic-12985
# Evaluate on 5 random tasks
python3 scripts/run_evaluation.py --predictions_path gold --random 5
# Evaluate all tasks
python3 scripts/run_evaluation.py --predictions_path gold --max_workers 8Create a JSONL file with model predictions:
{"instance_id": "pydantic-pydantic-12985", "model_patch": "diff --git a/..."}
{"instance_id": "owner-repo-123", "model_patch": "..."}Then evaluate:
python3 scripts/run_evaluation.py --predictions_path predictions.jsonl --max_workers 4For each task, the harness:
- Pull Docker image - Contains repo at base commit
- Run fail_to_pass tests BEFORE patch - Should FAIL (bug exists)
- Apply model patch
- Run fail_to_pass tests AFTER patch - Should PASS (bug fixed)
- Run pass_to_pass tests - Should PASS (no regression)
- Grade - Resolved if all tests pass as expected
| Parameter | Description |
|---|---|
--predictions_path |
Path to JSONL or "gold" for ground truth |
--max_workers |
Parallel workers (default: 4) |
--instance_ids |
Specific instances to evaluate |
--random N |
Evaluate N random instances |
--timeout |
Timeout per instance (default: 600s) |
--run_id |
Run identifier |
--output_dir |
Output directory |
--clean |
Cleanup Docker after evaluation |
Results are saved to evaluation_results/{run_id}/:
results.json- Overall metricsinstance_results.jsonl- Detailed per-instance results
- Resolution Rate: Percentage of patches that fixed the issue
- Tests Passed/Failed: Test execution results
- Duration: Evaluation time
Built on top of SweInfinite by @unconst.
Extended with:
- Python rewrite with full async support
- Agentic command discovery (NO hardcoding)
- Docker verification of generated tests
- Structured workspace export
- 200k context auto-compaction
- Configurable parallelism
MIT — see LICENSE.
SWE-Forge includes a comprehensive quality control pipeline to ensure tasks are valid and appropriately challenging.
Task Generation
↓
┌─────────────────────────────────┐
│ 1. Complexity Evaluation │
│ LLM assesses task difficulty │
│ Score: 0.0 (trivial) to 1.0 │
│ Reject if < 0.25 │
└─────────────────────────────────┘
↓
┌─────────────────────────────────┐
│ 2. Docker Verification │
│ Tests FAIL before patch │
│ Apply patch │
│ Tests PASS after patch │
│ Reject if tests don't work │
└─────────────────────────────────┘
↓
Accept Task
The complexity evaluator uses an LLM agent to analyze:
| Factor | Impact |
|---|---|
| Lines changed | More lines → higher score |
| Files modified | More files → higher score |
| Logic complexity | Complex logic → higher score |
| Context needed | More context → higher score |
| Change type | Config/docs → lower score |
Scoring thresholds:
| Score | Difficulty | Action |
|---|---|---|
| 0.0-0.25 | Trivial | ❌ REJECTED |
| 0.25-0.40 | Easy | ✅ Accepted |
| 0.40-0.65 | Medium | ✅ Accepted |
| 0.65-1.00 | Hard | ✅ Accepted |
Each task is verified in an isolated Docker container:
- Before patch: Tests MUST FAIL (proves bug exists)
- Apply patch:
git apply /workspace/patch.diff - After patch: Tests MUST PASS (proves fix works)
- Regression tests:
pass_to_passtests must stay passing
# Mining with quality control (default)
swe-forge mine mine --limit 100
# Adjust minimum complexity
swe-forge mine mine --min-complexity 0.30
# Skip Docker verification (faster, less reliable)
swe-forge mine mine --no-verify
# Skip complexity check (faster, accepts trivial tasks)
swe-forge mine mine --skip-complexity
# Use different model for evaluation
swe-forge mine mine --complexity-model openai/gpt-4Revalidate existing tasks to filter out invalid ones:
# Revalidate all tasks
python scripts/revalidate_tasks.py --tasks-dir ./tasks
# Skip Docker verification (complexity only)
python scripts/revalidate_tasks.py --tasks-dir ./tasks --no-verification
# Limit to N tasks
python scripts/revalidate_tasks.py --tasks-dir ./tasks --limit 10
# Custom threshold
python scripts/revalidate_tasks.py --tasks-dir ./tasks --min-complexity 0.30
# Output report
python scripts/revalidate_tasks.py --tasks-dir ./tasks --report report.jsonFor a typical mining run:
| Metric | Typical Value |
|---|---|
| Tasks generated | 100% |
| Rejected (complexity) | ~20% |
| Rejected (verification) | ~20% |
| Accepted | ~60% |
The acceptance rate of 30-70% is normal and ensures quality benchmarks.
When tasks are exported to HuggingFace, quality fields are included:
| Field | Description |
|---|---|
complexity_score |
0.0-1.0 complexity rating |
complexity_difficulty |
"easy", "medium", or "hard" |
verified |
True if Docker verification passed |
Filter on HF:
from datasets import load_dataset
ds = load_dataset("CortexLM/swe-forge")
# Only medium+ difficulty, verified tasks
filtered = ds.filter(lambda x: x['complexity_score'] >= 0.4 and x['verified'])