LLM Quick Start

_{Featured in}

New: Check out our sister project HarnessBench — fixes the base model, varies the harness. Same scoring pipeline, orthogonal axis.

git clone https://github.com/reacher-z/ClawBench.git && cd ClawBench && ./run.sh

_{Clone → configure → run. Root uv package. Docker-isolated harnesses.}

Can AI Agents Complete Everyday Online Tasks?

ClawBench is an open-source benchmark that evaluates AI browser agents on everyday online tasks — booking travel, ordering food, applying for jobs, managing email — across live websites. V1 lives in test-cases/v1/ with 153 tasks across 144 websites; V2 lives in test-cases/v2/ with 130 tasks. It measures end-to-end task success with a 5-layer recording pipeline and an agentic evaluator that compares each run against human references. Top score to date: 33.3%.

We asked frontier AI agents to do what people do every day --
order food, book travel, apply for jobs, write reviews, manage projects.
Even the best agent only completes about 1 in 3.

_{Built by NAIL Group · Sister project: HarnessBench · Runs on any Chrome.}

V1: 153 everyday tasks · V2: 130 tasks · 144 live websites · 15 life categories

中文

What are you looking for?

🏆 See scores Live leaderboard _{Pick a corpus (v1 / v2)}	🚀 Run it on your model Quick start ↓ _{pip install clawbench-eval}	📊 Browse 283 tasks Task explorer _{Search · filter · category}	📄 Read the paper arXiv:2604.08523 _{Methodology · evaluator · results}
🎬 Re-grade old runs V1 · V2 raw traces _{5 layers per (task × model)}	📦 Download the data `hf download NAIL-Group/ClawBench` _{Tasks · rubrics · metadata}	🌱 Add a task / model How to contribute _{YAML spec + rubric}	❓ Have a question FAQ · Discord _{Or open an issue}

News

[2026.05.20] — V2 is now the default corpus + lenient judge + 6 first-class harnesses. Details →
[2026.05.16] — Added Claw-Eval suite: 19 browser-research tasks with final-answer submission. Details →
[2026.05.12] — Canonical leaderboard moved to TIGER-Lab/ClawBench Gradio Space. Details →
[2026.05.11] — V2 leaderboard ships: top so far glm-5.1 / hermes at 18.5% reward / 48.5% intercepted. Details →
[2026.05.09] — Inline LLM judge added as second scoring stage; runs now auto-produce pass/fail. Details →
[2026.05.09] — clawbench-eval package published to PyPI for one-command install. Details →
[2026.05.09] — Released ClawBenchV1Trace: full 5-layer execution trace for every V1 run. Details →
[2026.04.25] — Added support for the hermes harness. Details →
[2026.04.18] — Added support for the browser-use harness. Details →
[2026.04.11] — Paper released on arXiv (2604.08523); #3 HuggingFace Paper of the Day. Details →

Live Websites Isolated Containers Request Interceptor Five-Layer Recording

Datasets

ClawBench ships three Hugging Face datasets — task definitions plus full execution traces for V1 and V2. All open, downloadable in one command. The benchmark itself is also mirrored on TIGER-Lab for visibility.

Dataset	What's in it	Get it
NAIL-Group/ClawBench (also mirrored at TIGER-Lab/ClawBench)	Task definitions, rubrics, and metadata for V1 (153 tasks) and V2 (130 tasks) — what to attempt and how it's judged.	`hf download --repo-type dataset NAIL-Group/ClawBench`
NAIL-Group/ClawBenchV1Trace	One directory per V1 model run, each with `recording.mp4`, `requests.jsonl`, `actions.jsonl`, `agent-messages.jsonl`, `interception.json`, and `run-meta.json` — everything we used to score the run.	`hf download --repo-type dataset NAIL-Group/ClawBenchV1Trace`
NAIL-Group/ClawBenchV2Trace	Same 5-layer bundle for V2 model runs. Rolling — new models added as they're evaluated.	`hf download --repo-type dataset NAIL-Group/ClawBenchV2Trace`

The trace datasets are large; use hf download --include "<pattern>" to pull a single model or a single task.

🏆 Live leaderboard: claw-bench.com/leaderboard (V2 default, two-stage scoring — interception + LLM judge). Full scoring formula in eval/scoring.md. Add your run: PR to leaderboard/results.csv.

How It Works

   You pick a task            ClawBench spins up           Agent drives the         Interceptor captures
   from V1 or V2              an isolated Docker           browser: navigates,      every action across
   everyday scenarios         container + Chromium         fills forms, clicks      all 5 layers of data

   ┌──────────────┐           ┌──────────────┐           ┌──────────────┐           ┌──────────────┐
   │  "Book a pet │    ──►    │   Container  │    ──►    │   AI Agent   │    ──►    │   5 layers   │
   │   sitter on  │           │  + Chromium  │           │  browses the │           │  intercepted │
   │   Rover"     │           │  + Agent     │           │   live site  │           │  & recorded  │
   └──────────────┘           └──────────────┘           └──────────────┘           └──────────────┘

LLM Quick Start

Point your coding agent (Claude Code, Cursor, Copilot, etc.) at AGENTS.md and prompt away.

Human Quick Start

Install ClawBench from PyPI for normal use:

uv tool install clawbench-eval

You can also use pipx install clawbench-eval or python -m pip install clawbench-eval. The installed commands are still clawbench, clawbench-run, and clawbench-batch.

For those want more granular control and contribution, clone the repo and run the root uv package entrypoint:

git clone https://github.com/reacher-z/ClawBench.git && cd ClawBench && ./run.sh

Prerequisites: Python 3.11+, uv, and a container engine — Docker or Podman. ClawBench auto-detects whichever is installed; force one with export CONTAINER_ENGINE=docker or export CONTAINER_ENGINE=podman.

Install Docker or Podman (macOS / Linux / Windows)

macOS

# Option A — Docker Desktop (easiest, includes GUI)
brew install --cask docker
open -a Docker                 # launch and wait for the whale icon to settle

# Option B — Podman (rootless, no daemon, CLI only)
brew install podman
podman machine init            # one-time: downloads the Linux VM image
podman machine start           # must be running before any podman command

macOS Podman needs a VM. brew install podman alone is not enough — Podman on macOS runs containers inside a small Linux VM, so you must podman machine init && podman machine start once after install or podman info will fail with Cannot connect to Podman.

Linux (Ubuntu / Debian)

# Option A — Podman (rootless by default, recommended)
sudo apt update && sudo apt install -y podman

# Option B — Docker
sudo apt install -y docker.io
sudo usermod -aG docker $USER  # log out / back in so your shell picks up the group

Rootful Docker ownership note: with classic sudo-docker, files extracted from containers land owned by root on the host. ClawBench's driver detects this after each run and chowns test-output/ back to your user automatically — but if you run other container tooling alongside, rootless Podman (or rootless Docker) avoids the issue entirely.

Windows

# Option A — Docker Desktop (WSL2 backend)
winget install Docker.DockerDesktop
# then launch Docker Desktop from the Start menu and wait for it to be ready

# Option B — Podman
winget install RedHat.Podman
podman machine init
podman machine start

Run the uv run … commands below from PowerShell, WSL2, or Git Bash. Like macOS, Windows Podman requires podman machine init && podman machine start before its first use.

1. Configure models — one-time setup.

If you installed from PyPI, run clawbench from the directory where you want results and editable config to live. On first launch it creates local templates under models/; use the TUI to add a model or edit the file directly:

clawbench
$EDITOR models/models.yaml

If you are working from a source checkout:

cp models/models.example.yaml models/models.yaml
$EDITOR models/models.yaml

PurelyMail credentials for disposable run emails are provided in the committed .env. You only need to edit .env if you want to use your own PurelyMail account or enable optional HuggingFace upload.

Note

First run builds a container image (Chromium + ffmpeg + noVNC + the selected agent harness dependencies). You'll see a live progress spinner with the current build step. Subsequent runs reuse the cached layers and finish in seconds.

2. Run your first task (pick one):

Tip

Recommended → Interactive TUI guided model + test case selection

clawbench         # PyPI install
uv run clawbench  # source checkout

If installed from PyPI, run clawbench directly. Needs an interactive terminal. For pipes / CI / non-TTY, use clawbench-run or clawbench-batch directly; from a source checkout, prefix commands with uv run.

(b) Run one specific task against a specific model:

uv run clawbench-run test-cases/v1/001-daily-life-food-uber-eats claude-sonnet-4-6

Once the container starts, the script prints a noVNC URL (e.g. http://localhost:6080/vnc.html) — open it in your browser to watch the agent operate in real-time. If port 6080 is already in use, an alternative port is chosen automatically.

Results land in ./test-output/<model>/<harness>-<case>-<model>-<timestamp>/ with the full five-layer recording. The default harness is openclaw; pass --harness opencode to use opencode, --harness claude-code to use Claude Code, --harness claude-code-chrome-extension to use Claude Code + the Claude in Chrome extension (Microsoft Edge + local bridge, bypass stack so any LiteLLM-routed provider works), --harness codex to use OpenAI Codex CLI, --harness claw-code to use claw-code, --harness browser-use to use browser-use (Python framework, routed via LiteLLM), --harness hermes to use Hermes Agent with native browser tools attached to ClawBench Chrome via CDP, or --harness pi to use Pi with pinned pi-browser-harness browser tools attached to the same ClawBench Chrome CDP endpoint.

(c) Drive the browser yourself via noVNC — produces a human reference run:

uv run clawbench-run test-cases/v1/001-daily-life-food-uber-eats --human

Open the noVNC URL the script prints, complete the task by hand, then close the tab. Port is auto-assigned if 6080 is busy.

(d) Pair with an external browser agent — run in Human mode, open the noVNC URL, and let an external browser agent control that browser session while ClawBench records and intercepts it.

Develop from source — clone + ``./run.sh`` for contributors

Prefer the repo checkout if you want to modify the driver, the bundled V1/V2 test cases, or the container build itself.

git clone https://github.com/reacher-z/ClawBench.git && cd ClawBench
cp models/models.example.yaml models/models.yaml   # edit: add your model API keys
# .env is already provided for PurelyMail; edit only for your own creds or HF upload
./run.sh                                           # interactive TUI
uv run clawbench-run \
  test-cases/v1/001-daily-life-food-uber-eats claude-sonnet-4-6   # single run
uv run clawbench-run \
  test-cases/v1/001-daily-life-food-uber-eats --human             # human mode

This path gives you live-reload on src/, src/clawbench/runtime/chrome-extension/, and all suites under test-cases/ — useful when iterating on the harness itself.

Reproduce the leaderboard

Our scores are stable: two independent runs of the same model under the same judge (deepseek/deepseek-v4-pro, lenient rubric) reproduce Intercepted and Reward within ±2 pp on the V2 130-task corpus.

There are two ways to verify this on your own machine.

Path A — Re-run the agent, then score

Confirms the full pipeline (your agent + our judge) lines up with our leaderboard row.

clawbench-batch --models deepseek/deepseek-v4-flash --cases-suite v2 \
  --all-cases --harness hermes --no-judge --output-dir ./my-run
clawbench-rescore ./my-run --judge-model deepseek-v4-pro --rubric both

Path B — Skip the run, re-judge our published traces

Confirms just the judge matches ours (cheap, no agent compute, useful for sanity-checking your judge config).

hf download --repo-type dataset TIGER-Lab/ClawBenchV2Trace \
  --include "batch-aligned-*/deepseek-v4-flash-free/**" --local-dir ./reproduce
clawbench-rescore ./reproduce --judge-model deepseek-v4-pro --rubric both

One-shot equivalent of Path B for any model in the leaderboard:

clawbench-reproduce --model deepseek-v4-flash --tolerance 2.0

Pass criterion

For deepseek-v4-flash:free × hermes × v2, the published row is Intercepted 3.1% / Reward-lenient 2.3% / Reward-strict 0.0% (3 / 129). Path A or B counts as reproduced when all three metrics land within ±2 pp. Larger gaps usually mean a different judge model, a different rubric prompt, or a harness configuration drift — diff your eval_results/<batch>/summary.json against the published row to localize the cause.

ClawBench-Lite

New here? Run this first. test-cases/v1-lite/ is a 20-task curated subset of the V1 153-task corpus, selected for household-name sites, real-world relevance, difficulty, and category diversity. It matches the 20-tasks-per-source convention of browser-use/benchmark and gives you a credible signal at a fraction of the full-benchmark cost.

Tier distribution: flagship 9 / core 8 / wildcard 3 — spanning daily life (OpenTable, DoorDash, Instacart, TaskRabbit), entertainment (Eventbrite, Goodreads, Fandango), creation (Asana, Mailchimp, Squarespace), travel (Airbnb), education (LeetCode), dev-tech (GitHub), academia (Overleaf), personal management (1Password), and more. All Lite tasks are judged by eval/agentic_eval.md regardless of url_pattern shape.

The Lite suite is a first-class task directory: run it with --cases-suite v1-lite, or inspect the link-backed task files in test-cases/v1-lite/.

Demos

Each ClawBench run produces a full MP4 session recording. See the project page for V1 task recordings.

Example Walkthrough

Curious what one task actually looks like, start to finish? Here's task 001 end to end.

The task — from test-cases/v1/001-daily-life-food-uber-eats/task.json:

{
  "instruction": "On Uber Eats, order delivery: one Pad Thai, deliver to home address, note \"no peanuts\"",
  "time_limit": 30,
  "eval_schema": {
    "url_pattern": "__PLACEHOLDER_WILL_NOT_MATCH__",
    "method": "POST"
  }
}

The agent gets this instruction verbatim, plus read-only access to /my-info/alex_green_personal_info.json (the dummy user's name, home address, phone, date of birth) and a disposable email account for any sign-in prompt. It has 30 minutes to reach a POST request — any longer and the container is killed.

What the agent does (the happy path):

Navigates to ubereats.com
Reads the dummy user's home address from /my-info/alex_green_personal_info.json and enters it in the delivery-address box
Searches for "Pad Thai" in the food search
Picks a restaurant that has Pad Thai available for delivery to that address
Opens the item detail page, finds the customization or special-instructions field, enters "no peanuts"
Adds one to cart, opens the cart, and handles any sign-in prompt using the disposable email credentials
Reaches checkout, taps Place Order

What the interceptor catches — that final Place Order tap fires a POST request. ClawBench's request interceptor sits in front of the browser and captures the outbound request before it reaches Uber Eats's servers, so the dummy user is never actually charged. At the exact moment of interception, all five recording layers (MP4 video, PNG screenshots, HTTP traffic, browser actions, agent messages) are frozen into /data/.

How the judge decides PASS / FAIL — task 001's url_pattern is the intentional sentinel __PLACEHOLDER_WILL_NOT_MATCH__, which means no request path can mechanically match. The verdict comes from the agentic judge in eval/agentic_eval.md, which replays the five-layer recording against a human reference run and checks four things:

Did the agent actually reach the final checkout step?
Is the cart exactly one Pad Thai (not two, not a combo)?
Is the delivery address the user's home address from alex_green_personal_info.json?
Does the order carry the "no peanuts" note in the instructions field?

All four must hold for a PASS. Miss any one and it's a FAIL with evidence from the recording pinned to the failing criterion. This per-task rubric is what makes ClawBench judge-sensitive rather than URL-regex-sensitive — see eval/README.md for the full rubric format and eval/agentic_eval.md for the judge prompt.

Results

ClawBench leaderboard · 6 tabs by corpus × harness · live at claw-bench.com

V2 (Hermes) · 8 models · ds-v4-pro judge, lenient + strict

Rank	Model	Harness	Intercepted	Reward (lenient)	Reward (strict)	Pass / Total
1	claude-opus-4-7	hermes	54.6%	44.6%	24.6%	58 / 130
2	gpt-5.5	hermes	45.4%	35.4%	18.5%	46 / 130
3	glm-5.1	hermes	48.5%	34.6%	17.7%	45 / 130
4	deepseek-v4-pro	hermes	43.9%	33.9%	12.3%	44 / 130
5	openrouter-owl-alpha	hermes	14.6%	0.0%	0.0%	0 / 130
6	z-ai/glm-4.5-air:free	hermes	4.6%	2.3%	0.8%	3 / 130
7	deepseek-v4-flash:free	hermes	3.1%	2.3%	0.0%	3 / 129
8	minimax-m2.5:free	hermes	2.3%	1.5%	0.0%	2 / 130

Intercepted = final HTTP request matched the task's URL/method (Stage 1, deterministic). Reward (lenient) = additionally judged by deepseek/deepseek-v4-pro to fulfill the instruction under the "no contradiction → match" rubric (Stage 2). Reward (strict) = same judge, strict rubric ("ambiguous → mismatch"). Ranked by Intercepted; Reward as tiebreak.

V2 (OpenClaw) · 1 model

Rank	Model	Harness	Intercepted	Reward (lenient)	Reward (strict)	Pass / Total
1	glm-5.1	openclaw	0.0%	0.0%	0.0%	0 / 130

V2 (Codex) · — (in progress)

In-flight: gpt-5.5-oauth, gpt-5.4-oauth, gpt-5.4-mini-oauth, gpt-5.3-codex-oauth, gpt-5.3-codex-spark-oauth, gpt-5.2-oauth. Will be filled in after judge_llm re-judge completes.

V2 (Claude Code) · — (not yet run)

—

V1 (Hermes) · 6 frontier models, original paper rubric

Rank	Model	Harness	Pass Rate	Pass / Total
1	claude-opus-4-6	hermes	61.4%	94 / 153
2	claude-sonnet-4-6	hermes	56.9%	87 / 153
3	claude-haiku-4-5-20251001	hermes	30.1%	46 / 153
4	gpt-5.4-2026-03-05	hermes	25.5%	39 / 153
5	gpt-5.4-mini-2026-03-17	hermes	24.8%	38 / 153
6	kimi-k2.5	hermes	17.6%	27 / 153

V1 Pass Rate is from the original paper rubric (Claude Code agentic-eval subagent comparing each run against human reference trajectories under eval/agentic_eval.md). The two-stage Reward (interception + deepseek/deepseek-v4-pro lenient judge) for V1 will appear here once V1 trace bundles are re-judged.

V1 per-category breakdown (Sonnet 4.6 vs 6-model comparison)

Rank	Model	Overall	Daily	Finance	Work	Dev	Academic	Travel	Social	Pets
1	Claude Sonnet 4.6	33.3	44.2	50.0	19.0	11.1	50.0	23.1	38.9	18.2
2	GLM-5	24.2	30.8	16.7	38.1	16.7	28.6	0.0	16.7	18.2
3	Gemini 3 Flash	19.0	15.4	33.3	23.8	22.2	28.6	30.8	11.1	0.0
4	Claude Haiku 4.5	18.3	15.4	22.2	19.0	27.8	21.4	7.7	16.7	18.2
5	GPT-5.4	6.5	9.6	0.0	0.0	11.1	7.1	7.7	0.0	9.1
6	Gemini 3.1 Flash Lite	3.3	1.9	0.0	0.0	5.6	14.3	0.0	0.0	9.1

V1 (OpenClaw) · — (not yet aggregated)

—

Task Categories (V1: 15 categories, 153 tasks)

Category	Tasks	Example Platforms
Daily Life	21	Uber Eats, DoorDash, Instacart, Zillow, Craigslist
Entertainment & Hobbies	15	Ticketmaster, AMC Theatres, Topgolf, Crunchyroll
Creation & Initialization	13	Squarespace, Wix, Webflow, Ghost, Substack
Rating & Voting	10	Trustpilot, G2, Goodreads, RateMyProfessors
Travel	9	Booking.com, Expedia, Airbnb, TripAdvisor
Education & Learning	9	Coursera, Udemy, Khan Academy, Duolingo
Office & Secretary	9	Google Calendar, Slack, Notion, Trello
Beauty & Personal Care	9	Sephora, Ulta, Glossier
Job Search & HR	8	LinkedIn, Greenhouse, Lever, Workday
Pet & Animal Care	8	Chewy, Petco, Rover
Personal Management	6	Mint, YNAB, Todoist
Shopping & Commerce	6	Amazon, eBay, Etsy, Target
Nonprofit & Charity	6	GoFundMe, DonorsChoose
Academia & Research	5	Google Scholar, Semantic Scholar, OpenReview
Finance & Investment	4	Robinhood, Fidelity, Coinbase
Others	15	Automation, Dev & Tech, Government, Home Services, Automotive

How ClawBench compares

Benchmark	Domain	Environment	Task count	ClawBench difference
WebArena	Synthetic web apps	Self-hosted replicas	812	Live consumer sites, not admin UIs on hosted replicas
GAIA	General assistants	Closed-book text + tools	466	Browser-centric; end-to-end task execution
SWE-bench	Software engineering	GitHub repos	2,294	Non-code; everyday consumer workflows
BrowserGym	Web agents	Headless sandbox	—	Cloud-parity; records real user journeys
Mind2Web	Web navigation	Static traces	2,350	Dynamic live websites, not replayed traces
Online-Mind2Web	Live web navigation	Real websites	300	4× more tasks (V1+V2: 283 vs 300 — comparable), with full 5-layer recordings
VisualWebArena	Visual web tasks	Self-hosted (3 sites)	910	Real websites with full visual layer (vs 3 hosted apps)
WebVoyager	Real-website nav	Real websites (15)	643	Interception-graded vs LLM-judge-only, 144 sites covered
TheAgentCompany	Office workflows	Self-hosted (6 platforms)	175	Consumer everyday tasks instead of enterprise sandbox

ClawBench's niche: live consumer websites, everyday tasks, end-to-end recording. If you want a controlled sandbox or replayed traces, the projects above are excellent. If you want to know whether your agent can actually order food or book a flight today, this is the benchmark for that.

Architecture

Container internals

┌─────────────────────────────────────────────────┐
│  Container (Docker / Podman)                    │
│                                                 │
│  ┌───────────┐   DOM events  ┌──────────────┐   │
│  │ content.js├──────────────►│ background.js│   │
│  │ (per tab) │               │  (service    │   │
│  └───────────┘               │   worker)    │   │
│                              └──┬──────┬────┘   │
│                                 │      │        │
│                         actions │      │ screenshots
│                                 │      │        │
│  ┌──────────┐            ┌──────▼──────▼────┐   │
│  │  Xvfb    │◄──ffmpeg──►│  FastAPI Server  │   │
│  │ :99      │  x11grab   │  :7878           │   │
│  └──────────┘            └──────────────────┘   │
│                                  │              │
│  ┌──────────┐            ┌───────▼─────────┐    │
│  │ Chromium │            │     /data       │    │
│  │ :9222 CDP│            │  actions.jsonl  │    │
│  └──────────┘            │  requests.jsonl │    │
│                          │  screenshots/   │    │
│                          │  recording.mp4  │    │
│                          └─────────────────┘    │
└─────────────────────────────────────────────────┘

CLI

# Interactive TUI (recommended):
./run.sh

# Single run:
uv run clawbench-run test-cases/v1/001-daily-life-food-uber-eats claude-sonnet-4-6

# Human mode (you control the browser via noVNC):
uv run clawbench-run test-cases/v1/001-daily-life-food-uber-eats --human

# Batch (all models x cases 1-50, 3 concurrent):
uv run clawbench-batch --all-models --case-range 1-50 --max-concurrent 3

# Batch all V1 tasks from test-cases/v1/:
uv run clawbench-batch --models claude-sonnet-4-6 --all-cases --max-concurrent 3

# Batch all V2 tasks from test-cases/v2/:
uv run clawbench-batch --models claude-sonnet-4-6 --cases-suite v2 --all-cases --max-concurrent 3

# Batch converted Claw-Eval tasks from test-cases/claw-eval/:
uv run clawbench-batch --models claude-sonnet-4-6 --cases-suite claw-eval --all-cases

# Batch a custom case directory:
uv run clawbench-batch --models claude-sonnet-4-6 --cases-dir custom-cases --all-cases

V1 tasks are in test-cases/v1/ (153 tasks). V2 tasks are in test-cases/v2/ (130 tasks), Lite is in test-cases/v1-lite/ (20 tasks), and converted Claw-Eval tasks live in test-cases/claw-eval/ (19 tasks). All suites use test-cases/task.schema.json. For test case authoring details, see CONTRIBUTING.md. For output structure and evaluation guidance, see eval/README.md.

Evaluation

Evaluation is a post-session step -- first run agents to collect trajectories, then evaluate them against human reference runs.

 1. Run agents (root uv package)   2. Evaluate (eval/)
 ─────────────────────────         ────────────────────────────────
 ./run.sh / clawbench-batch ──►    Claude Code subagents compare
 produces test-output/             agent vs human trajectories
   with 5-layer recordings         under eval/agentic_eval.md rubric

The evaluator compares each agent trajectory against a human reference trajectory across all five recording layers (video, screenshots, HTTP traffic, browser actions, agent messages), then outputs PASS/FAIL with evidence-backed justification.

See eval/README.md for the full evaluation guide and Claude Code prompt template.

FAQ

What data does each run produce?

Each session records five layers of synchronized data under /data/:

Layer	File	Description
Session replay	`recording.mp4`	Full session video (H.264, 15fps)
Action screenshots	`screenshots/*.png`	Timestamped PNG per browser action
Browser actions	`actions.jsonl`	Every DOM event (click, keydown, input, pageLoad, scroll, etc.)
HTTP traffic	`requests.jsonl`	Every HTTP request with headers, body, and query params
Agent messages	`agent-messages.jsonl`	Full agent conversation transcript (thinking, text, tool calls)

For the Pi harness, agent-messages.jsonl is filtered Pi JSON mode output, including message_start/message_end events, tool_execution_* events, tool-call content blocks, and thinking blocks when the selected model emits reasoning. Streaming message_update fragments, including *_delta rows, are omitted because complete assistant messages are already preserved in message_end events.

Harness diagnostic logs such as Pi's agent.log and proxy.log are not copied into the final data/ directory.

The interceptor result is saved to interception.json.

How does the request interceptor work?

The interceptor blocks critical, irreversible HTTP requests (checkout, form submit, email send) to prevent real-world side effects. It connects to Chrome via CDP's Fetch domain and matches requests against the eval schema (url_pattern regex + method + optional body/params). When triggered, it saves the blocked request to interception.json, kills the agent, and stops recording.

The interceptor does not validate task completion -- evaluation is handled separately by evaluators post-session.

For tasks behind payment walls (agent has no valid credit card), the eval schema uses a placeholder pattern that never matches, so the session runs until timeout.

What is the synthetic user profile?

Each container gets a /my-info/ directory with a dummy user identity (Alex Green): personal info JSON, email credentials, and a resume PDF. The email is a fresh disposable PurelyMail address generated per run. The agent reads these files when it needs to fill forms, register accounts, etc.

Source templates: src/clawbench/runtime/shared/alex_green_personal_info.json (profile) and src/clawbench/runner/run_support/resume_template.json (resume).

Can I use Podman instead of Docker?

Yes. Set export CONTAINER_ENGINE=podman. The framework auto-detects whichever is available. Podman works without root privileges.

What tools can the agent use?

All supported harnesses run inside the same container recording and interception environment. CLI/MCP harnesses expose the browser tool plus a restricted set of read-only shell commands (ls, cat, find, grep, head, tail, jq, wc, etc.); commands that could bypass the browser (curl, python, node, wget) are blocked. Hermes and Pi use native browser/file tools attached to the same ClawBench Chrome CDP endpoint. The Pi harness intentionally allowlists only read-only file tools and browser interaction tools; bash, write, edit, browser_http_get, and browser_run_script are not enabled. The agent instruction also explicitly requires browser-only task completion.

How do I add a new test case?

See CONTRIBUTING.md. In short: create a directory under the target corpus (test-cases/v1/ for V1 or test-cases/v2/ for V2) with a task.json conforming to test-cases/task.schema.json, define the eval schema, test with human mode, and submit a PR.

Contributing

We welcome contributions -- especially new test cases. If you've ever ordered groceries, booked an appointment, or filed a form online, you already know how to write one. Most PRs are a single JSON file and land in under a day.

Quick wins:

Add a new test case (~30 min, no container expertise needed)
Add a new category of 10+ tasks → co-author invitation on the next paper revision
Submit a new model to the public leaderboard
Browse good first issues

See CONTRIBUTING.md for the full guide and contributor recognition policy.

Community

Come hang out with researchers, builders, and contributors working on real-world browser agents.

_{English community
Agent builders, researchers, contributors}

_{中文社区
研究者、开发者、贡献者交流}

_{Async Q&A
Searchable, long-form, permanent}

Use the Discord and GitHub Discussions links for ongoing community support. For 微信群, use the QR link above.

Frequently Asked Questions

What is ClawBench? ClawBench is an open-source benchmark for AI browser agents — the systems (GPT-based, Claude-based, or open) that drive a real web browser to complete a user's task. V1 measures whether the agent actually finishes 153 everyday online tasks across 144 live websites; V2 adds a 130-task corpus in test-cases/v2/. It measures completion, not whether the agent produces the right-looking text.

What kinds of tasks does ClawBench cover? Fifteen life categories: food delivery, travel booking, job applications, shopping, housing search, email and calendar management, academic research, software development, learning platforms, and more. Every task is something a normal person might do in a normal week, on a real website.

Are 153 tasks enough for evaluation? Yes for a V1 benchmark signal: the 153 tasks span 144 live websites and 15 life categories, and each full run is expensive because it uses isolated containers, real websites, five-layer recording, and post-session judgment against human references. V2 adds another 130 tasks in test-cases/v2/. For cheaper iteration, start with the 20-task test-cases/v1-lite/ subset.

How is a task judged successful? Each task runs in an isolated browser container with a five-layer recording: video, screenshots, network requests, browser actions, and agent messages. For the original V1 results, an evaluator compares the agent trajectory against human reference runs and assigns PASS/FAIL with evidence from the recording. For V2 and newer leaderboard rows, scoring is two-stage: first, the request interceptor checks whether the final blocked HTTP request matches the task's URL/method schema; second, an LLM judge checks whether the captured request payload fulfills the natural-language instruction.

How do account login, registration, and initial task state work? Each run receives a synthetic user profile plus a fresh disposable PurelyMail address. If a task requires sign-up, the agent normally starts from scratch and registers during the run, using the provided identity and email. If a task needs starting files or workspace context, those files live under the task's extra_info/ directory and are mounted for the agent at runtime.

What happens when live websites change? Live-site change is part of the benchmark's target: ClawBench measures whether agents can handle production websites rather than frozen snapshots. That also means some runs can be affected by layout changes, availability, anti-bot systems, or alternate flows. Reproducibility comes from publishing task definitions, eval schemas, run metadata, and five-layer traces; repeated runs over time are still useful for measuring site drift.

Do CAPTCHA or bot checks dominate failures? If an agent encounters a CAPTCHA, it must attempt it. We have seen cases where frontier models are able to solve some CAPTCHAS. CAPTCHA failures can reflect model behavior, browser-control stack limits, or site defenses. The trace datasets make these failures inspectable.

What's the current top score? 33.3% — roughly one task in three — from the strongest frontier model we evaluated. The majority of tasks still defeat every model we've tested; the headroom is real, and the benchmark is not saturated.

Which harness are the published model results based on? The repo default is openclaw, but leaderboard rows include their harness explicitly. V1 results used OpenClaw; newer runs may use Hermes or other supported harnesses. Use the harness column when comparing models, because model and harness changes are separate experimental axes.

Is ClawBench tightly coupled to OpenClaw? No. OpenClaw is the default harness, but ClawBench supports interchangeable harnesses listed in src/clawbench/runtime/harnesses/harnesses.yaml.

Can ClawBench evaluate CLI agents? Yes. ClawBench is a browser-task benchmark, but CLI and coding-agent harnesses can drive the same instrumented Chromium session using native tools or MCPs.

How do I reproduce a published score? From a source checkout, configure models/models.yaml, then run uv run clawbench. The TUI builds the container image and runs local tasks against your model of choice. For batch runs, use --all-cases for the default V1 suite, --cases-suite v2 --all-cases for V2, or --cases-suite v1-lite --all-cases for Lite.

Will newer models be added? Yes. New model runs can be submitted or requested through the contribution flow and issues. Public rows are added as complete or clearly marked partial runs, depending on what has finished.

Is ClawBench safe to run against live websites? The runner uses a hardened container with a request interceptor that blocks purchases, account creation, outbound email sends, and similar irreversible actions by default. Tasks that need to simulate those actions (e.g., "add to cart and checkout") terminate at the last reversible step. You can relax the interceptor per-task if your research requires it.

Can I contribute new tasks or harnesses? Yes. V1 tasks live in test-cases/v1/; V2 tasks live in test-cases/v2/; Lite tasks live in test-cases/v1-lite/. Harness definitions live in src/clawbench/runtime/harnesses/harnesses.yaml. See CONTRIBUTING.md for the task schema and validation flow.

How does ClawBench relate to HarnessBench? Same scoring pipeline, orthogonal axis. ClawBench fixes the harness and varies the model; HarnessBench fixes the model and varies the harness. They share the V1 153-task corpus, the five-layer recording, and the agentic evaluator — so numbers are directly comparable.

Citation

If you use ClawBench in your research, please cite:

@misc{zhang2026clawbenchaiagentscomplete,
  title         = {ClawBench: Can AI Agents Complete Everyday Online Tasks?},
  author        = {Yuxuan Zhang and Yubo Wang and Yipeng Zhu and Penghui Du and Junwen Miao and Xuan Lu and Wendong Xu and Yunzhuo Hao and Songcheng Cai and Xiaochen Wang and Huaisong Zhang and Xian Wu and Yi Lu and Minyi Lei and Kai Zou and Huifeng Yin and Ping Nie and Liang Chen and Dongfu Jiang and Wenhu Chen and Kelsey R. Allen},
  year          = {2026},
  eprint        = {2604.08523},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CL},
  url           = {https://arxiv.org/abs/2604.08523}
}

Contact

Questions, suggestions, or research collaboration? Reach the maintainer:

Yuxuan Zhang — reacher [at] cs.ubc.ca (UBC, NAIL Group) · Homepage ↗
For bug reports or feature requests, please open a GitHub issue — it's faster than email and gets seen by all maintainers.

Core Contributors

_{Yuxuan Zhang}

_{Yubo Wang}

_{Perry Zhu}

_{Penghui Du}

_{Junwen Miao}

Advisors

_{Kelsey R. Allen}

_{Wenhu Chen}

_{Dongfu Jiang}

_{Liang Chen}

Support ClawBench

If ClawBench is useful for your research or product work, the single most helpful thing you can do is star the repo — it surfaces the benchmark to other AI-agent researchers and helps us justify continued dataset curation.

Open to contributions — new test cases, bug fixes, or evaluation submissions for a model we haven't scored yet. See CONTRIBUTING.md.

Star History

License & Acknowledgments

Apache 2.0 -- see LICENSE.

The converted Claw-Eval suite in test-cases/claw-eval/ is derived from claw-eval/claw-eval and the claw-eval/Claw-Eval dataset, which are released under the MIT License. Third-party package notices are in NOTICE.

Built with OpenClaw, opencode, Claude Code, the Claude in Chrome extension, OpenAI Codex CLI, browser-use, claw-code, Hermes Agent, and Pi with pi-browser-harness (selectable harnesses), Microsoft Playwright MCP (browser control bridge for the opencode, claude-code, codex, and claw-code harnesses), LiteLLM (API translation proxy for the claude-code, claude-code-chrome-extension, codex, browser-use, claw-code, and pi harnesses), noVNC (MPL 2.0), and websockify (LGPL 3.0).

Name		Name	Last commit message	Last commit date
Latest commit History 301 Commits
.devcontainer		.devcontainer
.github		.github
assets		assets
docs		docs
eval		eval
models		models
scripts		scripts
src		src
test-cases		test-cases
tests		tests
.env		.env
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
pyproject.toml		pyproject.toml
run.sh		run.sh
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

Can AI Agents Complete Everyday Online Tasks?

What are you looking for?

News

Datasets

How It Works

LLM Quick Start

Human Quick Start

macOS

Linux (Ubuntu / Debian)

Windows

Reproduce the leaderboard

Path A — Re-run the agent, then score

Path B — Skip the run, re-judge our published traces

Pass criterion

ClawBench-Lite

Demos

Example Walkthrough

Results

How ClawBench compares

Architecture

CLI

Evaluation

FAQ

Contributing

Community

Frequently Asked Questions

Citation

Contact

Core Contributors

Advisors

Support ClawBench

Star History

License & Acknowledgments

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 10

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages