Open-Source Benchmarks

Open-Source Benchmarks

Stealth Bench V1

71 tasks for evaluating browser stealth across anti-bot protections

Running the Stealth Benchmark

1. Install dependencies

pip install uv
uv sync

2. Set up your .env (see .env.example)

cp .env.example .env
# Fill in GOOGLE_API_KEY (required for the judge LLM)
# Fill in the API key for the browser provider you want to test

3. Decrypt the task set

python -c "
import base64, hashlib, json
from cryptography.fernet import Fernet
key = base64.urlsafe_b64encode(hashlib.sha256(b'Stealth_Bench_V1').digest())
tasks = json.loads(Fernet(key).decrypt(base64.b64decode(open('Stealth_Bench_V1.enc').read())))
print(f'Loaded {len(tasks)} tasks')
json.dump(tasks, open('Stealth_Bench_V1.json', 'w'), indent=2)
"

4. Run the evaluation

uv run python run_eval.py --browser <provider>

Available providers: browser-use-cloud, anchor, browserbase, browserless, hyperbrowser, onkernel, steel, local_headful, local_headless

Results and official data: stealth_bench/

BU Bench V1

100 hand-selected tasks for evaluating browser automation agents

Comparing Agent Frameworks

Comparing Models for Browser Use

Comparing Models for BrowserCode

Running BU Bench

1. Install dependencies

pip install uv
uv sync

2. Set up your .env (see .env.example)

cp .env.example .env
# Fill in BROWSER_USE_API_KEY (required for ChatBrowserUse and cloud browsers)
# Fill in GOOGLE_API_KEY (required for judge LLM)

3. Run evaluation

uv run python run_eval.py

Results are saved to results/ and detailed traces to run_data/.

Re-verifying Framework Results

Use run_framework_eval.py to rerun BU_Bench_V1 through a framework adapter. It decrypts BU_Bench_V1.enc in memory and writes local outputs to ignored results/ and run_data/.

uv run python run_framework_eval.py --list-frameworks
uv run python run_framework_eval.py --framework browser-use --browser browser-use-cloud --model bu-2-0

See the comment at the top of run_framework_eval.py for framework-specific setup, options, and examples.

Important: run_data/ traces include decrypted task text, ground truth, model outputs, and screenshots. They are gitignored for local verification only. Do not publish or commit them.

Swapping Models

Edit run_eval.py to change the model:

# Default: ChatBrowserUse (recommended)
agent = Agent(task=task["confirmed_task"], llm=ChatBrowserUse(), browser=browser)

# OpenAI
agent = Agent(task=task["confirmed_task"], llm=ChatOpenAI(model="gpt-4.1"), browser=browser)

# Anthropic
agent = Agent(task=task["confirmed_task"], llm=ChatAnthropic(model="claude-sonnet-4-5"), browser=browser)

# Google
agent = Agent(task=task["confirmed_task"], llm=ChatGoogle(model="gemini-2.5-flash"), browser=browser)

About BU Bench

100 tasks drawn from established benchmarks and custom challenges:

Source	Tasks	Description
Custom	20	Page interaction challenges
WebBench	20	Web browsing tasks
Mind2Web 2	20	Multi-step web navigation
GAIA	20	General AI assistant tasks (web-based)
BrowseComp	20	Browser comprehension tasks

WebBench, Mind2Web 2, and BrowseComp are released under the MIT license. GAIA has no explicit license; to comply with its data policies, we only include tasks from the "fully public" validation split, and all tasks are base64 encoded and encrypted to prevent data contamination.

Tasks were hand-selected for difficulty and verified to be achievable. Each task has been validated to confirm it can be completed successfully.

Important: The task set is stored in base64 encoding to prevent data contamination in LLM training. Please do not publish the tasks in plaintext or use them in model training data.

Task Format

Field	Description
`task_id`	Unique identifier
`confirmed_task`	Task instruction
`category`	Source benchmark
`answer`	Ground truth (if applicable)

Online-Mind2Web

The Online-Mind2Web benchmark is evaluated across agent frameworks.

Attributions

WebBench

MIT License | https://webbench.ai/

@misc{webbench2025,
  title = {WebBench: AI Web Browsing Agent Benchmark},
  author = {{Halluminate and Skyvern}},
  year = {2025},
  note = {\url{https://webbench.ai/}},
}

Mind2Web 2 (OMI2W-2)

MIT License | https://openreview.net/forum?id=AUaW6DS9si

@inproceedings{
    gou2025mind2web2,
    title={Mind2Web 2: Evaluating Agentic Search with Agent-as-a-Judge},
    author={Boyu Gou and Zanming Huang and Yuting Ning and Yu Gu and Michael Lin and Botao Yu and Andrei Kopanev and Weijian Qi and Yiheng Shu and Jiaman Wu and Chan Hee Song and Bernal Jimenez Gutierrez and Yifei Li and Zeyi Liao and Hanane Nour Moussa and TIANSHU ZHANG and Jian Xie and Tianci Xue and Shijie Chen and Boyuan Zheng and Kai Zhang and Zhaowei Cai and Viktor Rozgic and Morteza Ziyadi and Huan Sun and Yu Su},
    booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
    year={2025},
    url={https://openreview.net/forum?id=AUaW6DS9si}
}

BrowseComp

MIT License | https://cdn.openai.com/pdf/5e10f4ab-d6f7-442e-9508-59515c65e35d/browsecomp.pdf

@techreport{wei2025browsecomp,
  author = {Jason Wei and Zhiqing Sun and Spencer Papay and Scott McKinney and Jeffrey Han and Isa Fulford and Hyung Won Chung and Alex Tachard Passos and William Fedus and Amelia Glaese},
  title = {BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents},
  institution = {OpenAI},
  year = {2025},
  url = {https://cdn.openai.com/pdf/5e10f4ab-d6f7-442e-9508-59515c65e35d/browsecomp.pdf},
}

GAIA

No license (public validation split only) | https://huggingface.co/datasets/gaia-benchmark/GAIA

@misc{mialon2023gaia,
  title={GAIA: a benchmark for General AI Assistants},
  author={Gregoire Mialon and Clementine Fourrier and Craig Swift and Thomas Wolf and Yann LeCun and Thomas Scialom},
  year={2023},
  eprint={2311.12983},
  archivePrefix={arXiv},
  primaryClass={cs.CL}
}

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
.github/workflows		.github/workflows
browsers		browsers
fonts		fonts
frameworks		frameworks
official_plots		official_plots
official_results		official_results
old_results		old_results
online-mind2web/official_plots		online-mind2web/official_plots
results		results
stealth_bench		stealth_bench
.env.example		.env.example
.gitignore		.gitignore
BU_Bench_V1.enc		BU_Bench_V1.enc
README.md		README.md
Stealth_Bench_V1.enc		Stealth_Bench_V1.enc
generate_plots.py		generate_plots.py
judge.py		judge.py
models.py		models.py
orchestrator.py		orchestrator.py
pyproject.toml		pyproject.toml
run_batch.py		run_batch.py
run_eval.py		run_eval.py
run_framework_eval.py		run_framework_eval.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Open-Source Benchmarks

Stealth Bench V1

Running the Stealth Benchmark

BU Bench V1

Comparing Agent Frameworks

Comparing Models for Browser Use

Comparing Models for BrowserCode

Running BU Bench

Re-verifying Framework Results

Swapping Models

About BU Bench

Task Format

Online-Mind2Web

Attributions

WebBench

Mind2Web 2 (OMI2W-2)

BrowseComp

GAIA

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Open-Source Benchmarks

Stealth Bench V1

Running the Stealth Benchmark

BU Bench V1

Comparing Agent Frameworks

Comparing Models for Browser Use

Comparing Models for BrowserCode

Running BU Bench

Re-verifying Framework Results

Swapping Models

About BU Bench

Task Format

Online-Mind2Web

Attributions

WebBench

Mind2Web 2 (OMI2W-2)

BrowseComp

GAIA

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages