A harness for evaluating ML agents on Machine Learning Error analysis tasks.
This harness enables testing AI agents on their ability to:
- Diagnose ML-specific errors in Python code (data leakage, scaling errors, encoding issues, etc.)
- Localize issues to specific code locations
- Propose and implement fixes
- Verify their fixes work correctly
mle-reasoning-environment/
├── Dockerfile # Container definition
├── requirements.txt # Python dependencies
├── files/ # Static files mounted into container
├── tasks/ # Task JSONs
└── tools/
├── tools.py # Core tools (file I/O, Python exec, bash)
├── agent.py # Agent loop with tool calling
├── llm.py # LiteLLM abstraction for multiple providers
├── evaluator.py # Rubric-based scoring
├── run_agent.py # CLI for running single/batch tasks
├── agent_server.py # Integration wrapper
├── mcp_server.py # MCP server exposing tools
└── test_harness.py # Comprehensive test suite
| Tool | Description |
|---|---|
read_file |
Read contents of a file |
write_file |
Write content to a file (creates directories) |
list_files |
List directory contents |
run_python |
Execute a Python script with timeout |
bash_exec |
Execute bash commands with timeout |
Tasks are in JSON format:
{
"task_id": "error-analysis-1",
"prompt": [{"type": "text", "content": "...task description..."}],
"rubrics": [
{
"name": "IDENTIFY_ISSUE_TYPE",
"weight": 5,
"score": {"type": "discrete", "outcomes": [...]},
"messages": [{"type": "text", "content": "...criterion..."}]
}
],
"rubric_text": "+5|IDENTIFY_ISSUE_TYPE|...",
"use_docker": true,
"task_dir": "/app",
"task_files": {"src/code.py": "...", "data/train.csv": "..."},
"max_turns": 50
}- Docker installed and running
- API keys for your LLM provider:
export OPENAI_API_KEY="your-key" # or export ANTHROPIC_API_KEY="your-key"
cd mle-reasoning-environment
docker build -t mle-reasoning-environment .One task has been included in this repo for an example run (error-analysis-1.json)
cp -r path/to/your/task/dir/* mle-reasoning-environment/tasks/docker run --rm \
-e OPENAI_API_KEY=$OPENAI_API_KEY \
-v $(pwd)/tasks:/workspace/tasks \
-v $(pwd)/results:/workspace/results \
mle-reasoning-environment \
python run_agent.py --task /workspace/tasks/task_error-analysis-1-dev.jsondocker run --rm \
-e OPENAI_API_KEY=$OPENAI_API_KEY \
-v $(pwd)/tasks:/workspace/tasks \
-v $(pwd)/results:/workspace/results \
mle-reasoning-environment \
python run_agent.py --tasks-dir /workspace/tasks --output-dir /workspace/resultscd tools
pip install -r ../requirements.txt
python run_agent.py --task ../tasks/task_error-analysis-1-dev.json --model openai/gpt-4o-miniRun the comprehensive test suite:
cd tools
python test_harness.pyTests cover:
- All file operations (read, write, list)
- Python script execution with timeout
- Bash command execution
- Tool JSON schema generation
- End-to-end ML workflow simulation
| Code | Description |
|---|---|
DATA_LEAKAGE |
Model sees validation/test data during training |
OUTLIER |
Extreme samples not handled properly |
SCALING_ERROR |
Features scaled with wrong statistics |
ENCODING_ERROR |
Categorical variables encoded incorrectly |
IMBALANCE |
Class distribution skewed and not handled |
OVERFITTING |
Good train, poor validation performance |
UNDERFITTING |
Poor performance on both splits |
NONE |
No ML-specific errors present |
The evaluator uses rubric-based scoring:
- File checks: Verify required output files exist
- Content analysis: LLM-judged criteria for report quality
- Code verification: Check that fixes are implemented correctly
Scoring breakdown for a typical task (90 total points):
- Issue identification: 15 points
- Issue localization: 20 points
- Explanation quality: 5 points
- Fix proposal: 20 points
- Fix implementation: 12 points
- Format requirements: 8 points
- Output file checks: 13 points
The harness supports multiple LLM providers via LiteLLM:
| Provider | Model Format |
|---|---|
| OpenAI | openai/gpt-4o-mini, openai/gpt-4o |
| Anthropic | anthropic/claude-3-5-sonnet-latest |
google/gemini-1.5-pro |
For MCP-compatible clients:
python mcp_server.pyExposes all tools via the Model Context Protocol.
Agent results are saved as JSON:
{
"task_id": "error-analysis-1-dev",
"model": "openai/gpt-4o-mini",
"answer": "FINAL ANSWER: ...",
"metadata": {
"turns": 12,
"tool_calls": 25,
"errors": []
},
"evaluation": {
"score": 75,
"total_possible": 90,
"percentage": 83.3,
"results": [...]
}
}# macOS: Start Docker Desktop
open -a Docker# Check if set
echo $OPENAI_API_KEY
# Set if missing
export OPENAI_API_KEY="sk-..."The harness auto-detects python or python3 - this should work automatically.