DataSciBench: An LLM Agent Benchmark for Data Science

📃 [DataSciBench] [GitHub] [Evaluation Data] [Website]

Fork Changes

This is a fork of THUDM/DataSciBench with significant optimizations for faster benchmark execution and improved reliability. Key changes include:

3x faster runs — reduced the number of runs per task from 10 to 3
Early stopping — agents are limited to 5 attempts per sub-task to prevent infinite retry loops
Repetition loop detection — custom detector that aborts LLM generation when output starts looping
Forced streaming — all LLM calls now use streaming mode to enable real-time repetition detection
API key rotation — automatic round-robin across multiple API keys
Bug fixes — corrected metric calculations, path handling, regex escaping, and matplotlib backend issues

For the full list of changes and benchmark results, see CHANGELOG.md.

Benchmark Generator

We added a benchmark_generator module for automatic generation of new benchmark tasks from an arbitrary codebase using LLM. The pipeline includes prompt generation, automated solving, human review, and packing into the benchmark format. See benchmark_generator/README.md for details.

Introduction

Install MetaGPT

cd MetaGPT
pip install .

Data

All collected prompts have been processed through notebooks/preprocess_prompt.ipynb and save into data/{task_id}/.

Prepare Env

Run

pip install -r requirements.txt

To run dl experiments, you may also need to log in to wandb.

Experimental Setting

Config model in ~/.metagpt/config2.yaml/

Running

Run all experiments as follows

python -c "import sys; sys.path.append('/path/to/DataSciBench'); from experiments import run_examples"

python -m experiments.run_examples

Run a particular experiment, as follows

python -m experiments.run_examples --task_id dl_0

python -m experiments.run_examples --data_type dl --config test_config_dl.yaml

Specifies to run a prompt of some kind, for example, to run a prompt with no external data dependencies

python -m experiments.run_examples --data_source_type 1

Check the generation process

python -m evaluations.check_result --model_id all

Output Sample

{
    "time_cost": 146.17888736724854,
    "error_list": [
        1,
        2,
        2,
        2,
        2,
        2,
        0
    ],
    "cost": [
        37966,
        4465,
        0.04689600000000001,
        0
    ],
    "plan": [
        {
            "task_id": "1",
            "dependent_task_ids": [],
            "instruction": "Fine-tune the sentiment classification model using the EleutherAI/twitter-sentiment dataset",
            "task_type": "predictive modeling",
            "code": "tokenizer = GPT2Tokenizer.from_pretrained('../gpt2-small/')",
            "result": "",
            "is_success": true,
            "is_finished": true
        },
        {
            "task_id": "2",
            "dependent_task_ids": [
                "1"
            ],
            "instruction": "Verify the existence of the 'evaluation_data.parquet' file at the specified location '../' and update the file path if necessary",
            "task_type": "data exploration",
            "code": "import pandas as pd\n\n# Verify the existence of the evaluation data file\ntry:\n    evaluation_data = pd.read_parquet('../evaluation_data.parquet')\n\n    # Display basic statistical indicators\n    print(evaluation_data.describe())\n\n    # Visualize data distributions\n    import matplotlib.pyplot as plt\n    import seaborn as sns\n\n    # Histogram\n    plt.figure(figsize=(10, 6))\n    sns.histplot(evaluation_data['sentiment'], kde=False)\n    plt.title('Sentiment Distribution')\n    plt.xlabel('Sentiment')\n    plt.ylabel('Count')\n    plt.show()\n\n    # Box plot\n    plt.figure(figsize=(10, 6))\n    sns.boxplot(x=evaluation_data['sentiment'])\n    plt.title('Sentiment Distribution')\n    plt.xlabel('Sentiment')\n    plt.show()\n\n    # Correlation matrix\n    correlation_matrix = evaluation_data.corr()\n    plt.figure(figsize=(10, 6))\n    sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f')\n    plt.title('Correlation Matrix')\n    plt.show()\nexcept FileNotFoundError as e:\n    print('File not found:', e)",
            "result": "File not found: [Errno 2] No such file or directory: '../evaluation_data.parquet'\n",
            "is_success": true,
            "is_finished": true
        },
        {
            "task_id": "3",
            "dependent_task_ids": [
                "2"
            ],
            "instruction": "Generate a comprehensive report on the model's performance in PDF format",
            "task_type": "other",
            "code": "from reportlab.lib import colors\nfrom reportlab.lib.pagesizes import letter\nfrom reportlab.platypus import SimpleDocTemplate, Table, TableStyle\n\n# Sample data for demonstration\ndata = [['Metric', 'Value'],\n        ['Accuracy', '0.85'],\n        ['Precision', '0.78'],\n        ['Recall', '0.92'],\n        ['F1 Score', '0.84']]\n\n# Create PDF\npdf_filename = './performance_report.pdf'\npdf = SimpleDocTemplate(pdf_filename, pagesize=letter)\ntable = Table(data)\n\n# Add style to the table\nstyle = TableStyle([('BACKGROUND', (0, 0), (-1, 0), colors.grey),\n                    ('TEXTCOLOR', (0, 0), (-1, 0), colors.whitesmoke),\n                    ('ALIGN', (0, 0), (-1, -1), 'CENTER'),\n                    ('FONTNAME', (0, 0), (-1, 0), 'Helvetica-Bold'),\n                    ('BOTTOMPADDING', (0, 0), (-1, 0), 12),\n                    ('BACKGROUND', (0, 1), (-1, -1), colors.beige),\n                    ('GRID', (0, 0), (-1, -1), 1, colors.black)])\n\ntable.setStyle(style)\n\n# Build PDF\npdf.build([table])\n\nprint(f\"Performance report generated at: {pdf_filename}\")\n",
            "result": "Performance report generated at: ./performance_report.pdf\n",
            "is_success": true,
            "is_finished": true
        }
    ]
}

Evaluation

Evaluate functions of BigCodeBench

python -m experiments.evaluate_tmc --model_id glm-4.5-flash

Evaluate other tasks

Set ground truth

Download ground truth from https://huggingface.co/datasets/zd21/DataSciBench/tree/main and move it to data/task_id/gt

Run evaluation

python -m experiments.evaluate

python -m experiments.evaluate --task_id all --model_id glm-4.5-flash

Calculate metric

Calculate metrics for BigCodeBench

python -m evaluation_results.calculate_final_metric

Others

Modification of Data Interpreter (deprecated)

See SciDataInterpreter.update_results_for_eval at role/sci_data_interpreter. We can get the plans with codes and results from each step; the costs for each step per plan; and the number of errors.

See SciDataInterpreter.get_CR at role/sci_data_interpreter. We can get the Completion Rate for this question that just ran. (Ground Truth not incorporated, so max at 0.5)

Citation

If you find our work helpful, please kindly cite our paper:

@article{zhang2025datascibench,
        title={DataSciBench: An LLM Agent Benchmark for Data Science},
        author={Zhang, Dan and Zhoubian, Sining and Cai, Min and Li, Fengzu and Yang, Lekang and Wang, Wei and Dong, Tianjiao and Hu, Ziniu and Tang, Jie and Yue, Yisong},
        journal={arXiv preprint arXiv:2502.13897},
        year={2025}
        }

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
benchmark_generator		benchmark_generator
data		data
evaluation_results		evaluation_results
evaluations		evaluations
experiments		experiments
metagpt		metagpt
metric		metric
notebooks		notebooks
role		role
scripts		scripts
sft		sft
src		src
tests		tests
utils		utils
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
README.md		README.md
config2.yaml		config2.yaml
find_used_metagpt.py		find_used_metagpt.py
pyproject.toml		pyproject.toml
qwen3_8b.yaml		qwen3_8b.yaml
rpd.md		rpd.md
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DataSciBench: An LLM Agent Benchmark for Data Science

Fork Changes

Benchmark Generator

Introduction

Install MetaGPT

Data

Prepare Env

Experimental Setting

Running

Check the generation process

Output Sample

Evaluation

Evaluate functions of BigCodeBench

Evaluate other tasks

Set ground truth

Run evaluation

Calculate metric

Calculate metrics for BigCodeBench

Others

Modification of Data Interpreter (deprecated)

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DataSciBench: An LLM Agent Benchmark for Data Science

Fork Changes

Benchmark Generator

Introduction

Install MetaGPT

Data

Prepare Env

Experimental Setting

Running

Check the generation process

Output Sample

Evaluation

Evaluate functions of BigCodeBench

Evaluate other tasks

Set ground truth

Run evaluation

Calculate metric

Calculate metrics for BigCodeBench

Others

Modification of Data Interpreter (deprecated)

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages