Skip to content

Add MLE-Bench competition environment#1341

Open
poofeth wants to merge 5 commits into
PrimeIntellect-ai:mainfrom
poofeth:bounty/mle-bench-env
Open

Add MLE-Bench competition environment#1341
poofeth wants to merge 5 commits into
PrimeIntellect-ai:mainfrom
poofeth:bounty/mle-bench-env

Conversation

@poofeth
Copy link
Copy Markdown

@poofeth poofeth commented May 11, 2026

Summary

Adds an installable mle-bench environment for the Prime MLE-Bench bounty.

This environment:

  • represents MLE-Bench competitions as v1 task rows
  • defaults to the low/lite split and includes the official low-complexity competition IDs
  • optionally enriches prompts with descriptions from the upstream mlebench registry when that package is installed
  • uses an OpenCode harness with sandboxed task metadata
  • follows the benchmark submission contract: agents must create /home/submission/submission.csv
  • validates submissions through /home/validate_submission.sh /home/submission/submission.csv when run in an MLE-Bench image with prepared data
  • avoids downloading Kaggle data or requiring credentials during import/local unit tests

/claim https://algora.io/PrimeIntellect-ai/bounties/ZC7X91RCWXHy4BL2

Validation

uv run pytest tests/test_mle_bench_environment.py -q
# 6 passed

uv run python - <<'PY'
import sys
from pathlib import Path
sys.path.insert(0, str(Path('environments/mle_bench').resolve()))
import mle_bench
loaded = mle_bench.load_environment(limit=1)
print(type(loaded).__name__, loaded.taskset.taskset_id, len(list(loaded.taskset.source())))
PY
# Env mle-bench 1

uv run ruff check environments/mle_bench tests/test_mle_bench_environment.py
# All checks passed

uv run ruff format --check environments/mle_bench tests/test_mle_bench_environment.py
# 2 files already formatted

git diff --check
# no output

Note

Medium Risk
Adds a new sandboxed evaluation environment that executes filesystem checks and a validation script inside the sandbox; correctness depends on runtime images and validator output parsing.

Overview
Adds a new installable mle-bench v1 environment that models each MLE-Bench competition as a sandboxed task requiring the agent to write /home/submission/submission.csv and earn reward only when /home/validate_submission.sh accepts it.

The taskset supports low/lite/dev/all splits, optionally enriches prompts from the upstream mlebench registry when available, and records validator stdout/stderr/exit code in rollout state; includes packaging (pyproject.toml entry point), documentation, and unit tests, plus a README index update to list the new environment.

Reviewed by Cursor Bugbot for commit e246cbb. Bugbot is set up for automated code reviews on this repo. Configure here.

@poofeth
Copy link
Copy Markdown
Author

poofeth commented May 11, 2026

@poofeth
Copy link
Copy Markdown
Author

poofeth commented May 11, 2026

Follow-up commit 8e220cf strengthens the MLE-Bench handoff:

  • added grading_submission_row(task) and grading_submission_jsonl(task) for the mlebench grade JSONL shape
  • added submission_nonempty and validator_available sandbox metrics
  • extended tests for grader handoff rows and the new sandbox metrics

Fresh validation:

uv run pytest tests/test_mle_bench_environment.py -q
# 8 passed

uv run python - <<'PY'
import sys
from pathlib import Path
sys.path.insert(0, str(Path('environments/mle_bench').resolve()))
import mle_bench
loaded = mle_bench.load_environment(limit=1)
row = list(loaded.taskset.source())[0]
print(type(loaded).__name__, loaded.taskset.taskset_id, mle_bench.grading_submission_row(row)['competition_id'])
PY
# Env mle-bench aerial-cactus-identification

uv run ruff check environments/mle_bench tests/test_mle_bench_environment.py
# All checks passed

uv run ruff format --check environments/mle_bench tests/test_mle_bench_environment.py
# 2 files already formatted

git diff --check
# no output

Comment thread environments/mle_bench/mle_bench.py Outdated
Comment thread environments/mle_bench/mle_bench.py Outdated
@poofeth
Copy link
Copy Markdown
Author

poofeth commented May 11, 2026

Additional package-install smoke passed for the environment package metadata:

uv run --with ./environments/mle_bench python - <<'PY'
from mle_bench import load_environment, grading_submission_row
loaded = load_environment(limit=1)
row = list(loaded.taskset.source())[0]
print(loaded.taskset.taskset_id, grading_submission_row(row)['competition_id'])
PY
# mle-bench aerial-cactus-identification

@poofeth
Copy link
Copy Markdown
Author

poofeth commented May 11, 2026

Addressed the two Cursor Bugbot findings in commit edc3966:

  • replaced substring matching on "valid" with exact validator success-line detection, so "invalid" can no longer be accepted accidentally
  • removed the unused DEFAULT_SUBMISSION_JSONL constant
  • added regression coverage for invalid validator output and exact success-line parsing

Fresh validation:

uv run pytest tests/test_mle_bench_environment.py -q
# 10 passed

uv run python - <<'PY'
import sys
from pathlib import Path
sys.path.insert(0, str(Path('environments/mle_bench').resolve()))
import mle_bench
loaded = mle_bench.load_environment(limit=1)
row = list(loaded.taskset.source())[0]
print(type(loaded).__name__, loaded.taskset.taskset_id, mle_bench.grading_submission_row(row)['competition_id'])
PY
# Env mle-bench aerial-cactus-identification

uv run --with ./environments/mle_bench python - <<'PY'
from mle_bench import load_environment, grading_submission_row
loaded = load_environment(limit=1)
row = list(loaded.taskset.source())[0]
print(loaded.taskset.taskset_id, grading_submission_row(row)['competition_id'])
PY
# mle-bench aerial-cactus-identification

uv run ruff check environments/mle_bench tests/test_mle_bench_environment.py
# All checks passed

uv run ruff format --check environments/mle_bench tests/test_mle_bench_environment.py
# 2 files already formatted

git diff --check
# no output

Comment thread environments/mle_bench/mle_bench.py
@poofeth
Copy link
Copy Markdown
Author

poofeth commented May 11, 2026

Addressed the current-head Bugbot prompt-path finding in e246cbb.

Changes:

  • Removed default /home/... paths from the shared BENCHMARK_INSTRUCTIONS system prompt.
  • build_prompt now derives the competition instructions path, dataset directory, submission path, and validation command from the configured workdir, submission_path, and validate_script.
  • Added test_mle_bench_prompt_uses_configured_paths to cover custom path configurations and assert the default paths are not leaked into the prompt.

Validation after the fix:

  • uv run pytest tests/test_mle_bench_environment.py -q -> 11 passed
  • uv run ruff check environments/mle_bench tests/test_mle_bench_environment.py -> passed
  • uv run ruff format --check environments/mle_bench tests/test_mle_bench_environment.py -> passed
  • uv run --with ./environments/mle_bench python ... -> mle-bench aerial-cactus-identification
  • git diff --check -> passed

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit e246cbb. Configure here.

Comment thread environments/mle_bench/mle_bench.py Outdated
@poofeth
Copy link
Copy Markdown
Author

poofeth commented May 11, 2026

Addressed the remaining current-head Bugbot workdir finding in d2e6b9e.

Changes:

  • Added workdir to each MLE-Bench task record under task["info"].
  • valid_submission now passes the configured workdir as the validator command working directory instead of hardcoding /home.
  • Added test_mle_bench_valid_submission_uses_configured_workdir to cover non-default workdir, submission path, and validator path together.

Validation after the fix:

  • uv run pytest tests/test_mle_bench_environment.py -q -> 12 passed
  • uv run ruff check environments/mle_bench tests/test_mle_bench_environment.py -> passed
  • uv run ruff format --check environments/mle_bench tests/test_mle_bench_environment.py -> passed
  • PYTHONPATH=environments/mle_bench uv run python ... -> local import smoke loaded environments/mle_bench/mle_bench.py and printed mle-bench aerial-cactus-identification /home
  • git diff --check -> passed

Note: uv run --with ./environments/mle_bench ... reused a cached package archive for the same local package name during this session, so I used PYTHONPATH for the final smoke to prove the current checkout contents directly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant