Awesome Agent Benchmarks

A curated list of agent benchmarks and evaluation frameworks.

Coding

Commit0 (Dec 2024) 📄 30 ⭐ 190
Paper · Github · Website

Evaluates AI agents on writing complete software libraries from scratch given specification documents and interactive unit test suites.
LoCoBench-Agent (Nov 2025) 📄 7 ⭐ 19
Paper · Github

Evaluates LLM agents in realistic long-context software engineering workflows, testing multi-turn conversations, tool usage efficiency, error recovery, and architectural consistency across context lengths from 10K to 1M tokens.
MLE-bench (Oct 2024) 📄 200 ⭐ 1.5k
Paper · Github · Website

Measures agents on machine learning engineering using 75 competitions from Kaggle, testing skills including model training, dataset preparation, and experiment execution.
🆕 ProjDevBench (Feb 2026) 📄 4 ⭐ 17
Paper · Github

Evaluates coding agents on end-to-end project development across 20 programming problems in 8 categories, assessing system architecture design, functional correctness, and iterative solution refinement.
SciCode (Jul 2024) 📄 102 ⭐ 191
Paper · Github

Evaluates agents on generating code to solve 80 real scientific research problems decomposed into 338 subproblems across 16 natural science fields including math, physics, chemistry, biology, and materials science.
SWE-bench (Oct 2023) 📄 2.0k ⭐ 4.8k
Paper · Github · Website

Evaluates agents' ability to resolve real GitHub issues by editing codebases, drawn from 2,294 issues across 12 Python repositories.
SWE-EVO (Dec 2025) 📄 10
Paper

Evaluates coding agents on 48 long-horizon software evolution tasks requiring multi-step modifications spanning an average of 21 files, drawn from version histories of 7 mature open-source Python projects.
SWE-rebench (May 2025) 📄 52
Paper · Website

A continuously refreshed benchmark of 21,000+ interactive Python-based software engineering tasks extracted from GitHub repositories, designed to prevent evaluation contamination.
USACO (Apr 2024) 📄 67 ⭐ 124
Paper · Github · Website

Evaluates agents on 307 competitive programming problems from the USA Computing Olympiad, requiring complex algorithmic reasoning, puzzle solving, and efficient code generation.

Computer Use

AndroidWorld (May 2024) 📄 261 ⭐ 744
Paper · Github

A dynamic benchmarking environment for autonomous agents on 116 programmatic tasks across 20 real-world Android apps, with tasks parameterized and expressed in natural language.
CRAB (Jul 2024) 📄 34 ⭐ 416
Paper · Github · Website

Evaluates multimodal agents on 120 tasks across computer desktop and mobile phone environments using a graph-based fine-grained evaluation method.
MobileWorld (Dec 2025) 📄 17
Paper

Evaluates autonomous mobile agents on 201 tasks across 20 applications in agent-user interactive and MCP-augmented environments, emphasizing long-horizon cross-application workflows.
OmniACT (Feb 2024) 📄 136
Paper · Dataset

Evaluates agents on generating executable programs to complete computer tasks across desktop and web applications, using both text and visual UI understanding.
OSUniverse (May 2025) 📄 7 ⭐ 24
Paper · Github · Website

Evaluates agents on complex multimodal desktop tasks of increasing difficulty, from basic UI interactions to multistep workflows requiring spatial reasoning and dexterity.
OSWorld (Apr 2024) 📄 593 ⭐ 2.8k
Paper · Github · Website

Evaluates multimodal agents on 369 real-world computer tasks across OS, web, and desktop apps.
OSWorld-MCP (Oct 2025) 📄 8 ⭐ 222
Paper · Github · Website

Evaluates computer-use agents on 158 tasks combining MCP tool invocation and GUI operations across 7 common applications in a real-world environment.
SCUBA (Sep 2025) 📄 4 ⭐ 8
Paper · Github · Website

Evaluates computer-use agents on CRM workflows within Salesforce across 300 tasks spanning three personas - platform administrators, sales representatives, and service agents.
Spider2-V (Jul 2024) 📄 50 ⭐ 151
Paper · Github · Website

Evaluates multimodal agents on 494 real-world data science and engineering tasks across 20 enterprise applications, requiring SQL queries, Python code, and GUI operations.
Terminal-Bench 1.0 ⭐ 2.1k
Github · Website

Evaluates agents' ability to use a computer terminal on ~100 tasks.
Terminal-Bench 2.0 (Jan 2026) 📄 67 ⭐ 207
Paper · Github · Website · Agentbeats

Curated hard benchmark of 89 tasks in computer terminal environments, spanning software engineering, ML, security, and data science.
UI-CUBE (Nov 2025) 📄 2
Paper · Github · Website

Evaluates enterprise computer use agents on 226 tasks spanning simple UI interactions and complex workflows in enterprise applications, with multi-resolution testing and automated validation.
WorkArena (Mar 2024) 📄 179 ⭐ 245
Paper · Github · Website

Evaluates agents on 33 knowledge worker tasks using the ServiceNow enterprise platform, measuring ability to handle realistic workplace software workflows.
WorkArena++ (Jul 2024) 📄 45 ⭐ 245
Paper · Github · Website

Evaluates LLMs and vision-language models on 682 realistic enterprise workflow tasks requiring planning, problem-solving, logical reasoning, retrieval, and contextual understanding within ServiceNow.

Deep Research

BrowseComp-Plus (Aug 2025) 📄 86 ⭐ 255
Paper · Github · Website

Evaluates deep-research agents using a fixed curated document corpus, enabling controlled experiments on retrieval methods, citation accuracy, and context engineering.
DeepResearch Bench (Jun 2025) 📄 124 ⭐ 697
Paper · Github · Website

Evaluates deep research agents on 100 PhD-level research tasks across 22 fields crafted by domain experts, assessing report quality and information retrieval effectiveness.
DeepScholar-Bench (Aug 2025) 📄 18 ⭐ 237
Paper · Github · Website

A live benchmark for generative research synthesis evaluating systems that retrieve from the live web and produce long-form cited reports, assessed on knowledge synthesis, retrieval quality, and verifiability.
MMDeepResearch-Bench (Jan 2026) 📄 6 ⭐ 27
Paper · Github · Website

Evaluates multimodal deep research agents on 140 expert-crafted tasks across 21 domains, assessing citation-rich report synthesis where models must connect visual artifacts to sourced claims.
PaperArena (Oct 2025) 📄 6 ⭐ 15
Paper · Github · Website

Evaluates LLM-based agents on questions requiring multi-paper information integration using external tools, with agents achieving only 38.78% average accuracy.
WideSearch (Aug 2025) 📄 34 ⭐ 132
Paper · Github · Website

Evaluates agent reliability on large-scale information collection through 200 manually curated questions across 15+ domains requiring agents to gather and organize atomic verifiable information.

Embodied

ALFWorld (Oct 2020) 📄 791 ⭐ 724
Paper · Github · Website

A benchmark and simulator bridging abstract text-based planning with embodied visual task execution, enabling agents to learn policies in TextWorld and transfer them to ALFRED visual environments.
BEHAVIOR-1K (Mar 2024) 📄 119 ⭐ 1.4k
Paper · Github · Website

A comprehensive simulation benchmark for human-centered robotics agents covering 1,000 everyday activities across 50 realistic scenes with over 9,000 annotated objects.
MineAnyBuild (May 2025) 📄 1 ⭐ 14
Paper · Github

Evaluates agents on spatial planning in Minecraft across 4,000 curated tasks covering spatial understanding, reasoning, creativity, and commonsense, requiring generation of executable architecture building plans.
Robotouille (Feb 2025) 📄 25 ⭐ 40
Paper · Github · Website

Evaluates LLM agents on long-horizon asynchronous kitchen planning scenarios, testing their capacity to manage overlapping tasks, interruptions, and parallel execution.

General Capabilities

AgencyBench (Jan 2026) 📄 7 ⭐ 77
Paper · Github · Website

Evaluates autonomous agents on 6 core agentic capabilities across 32 real-world scenarios comprising 138 tasks, each requiring approximately 90 tool calls and 1 million tokens.
API-Bank (Apr 2023) 📄 400
Paper · Github

Evaluates LLMs on planning, retrieving, and calling APIs across 73 tools and 314 annotated tool-use dialogues containing 753 API calls.
AppWorld (Jul 2024) 📄 144 ⭐ 406
Paper · Github · Website

Evaluates autonomous agents on 750 tasks requiring rich interactive code generation across 9 everyday applications accessible through 457 APIs, testing multi-app coordination and avoidance of unintended side effects.
CRMArena (Nov 2024) 📄 43 ⭐ 136
Paper · Github

Evaluates AI agents on realistic customer service tasks across 9 task types and 3 personas using 16 industrial CRM objects with interconnected data simulating enterprise environments.
DeepPlanning (Jan 2026) 📄 9
Paper · Website

Evaluates LLM agents on long-horizon planning through multi-day travel planning and shopping tasks requiring proactive information acquisition, local constrained reasoning, and global constrained optimization.
GAIA (Nov 2023) 📄 685
Paper · Website

Evaluates general AI assistants on 466 real-world questions requiring reasoning, multi-modality handling, web browsing, and tool-use proficiency, with humans scoring 92% vs. 15% for GPT-4 with plugins.
Gaia2 (Sep 2025) 📄 16 ⭐ 479
Paper · Github · Website

Measures general agent capabilities in dynamic, asynchronous environments requiring agents to handle ambiguities, adapt to noise, collaborate with other agents, and operate under temporal constraints.
GTA (Jul 2024) 📄 69 ⭐ 144
Paper · Github · Website

Evaluates LLMs on real-world tool use with 229 tasks using human-written queries, real deployed tools, and authentic multimodal inputs such as images, screenshots, and code files.
HCAST (Mar 2025) 📄 9 ⭐ 20
Paper · Github

Evaluates AI agents on 189 tasks across ML engineering, cybersecurity, software engineering, and general reasoning, using human baselines collected over 1,500+ hours to calibrate difficulty.
MCP-Bench (Aug 2025) 📄 47 ⭐ 475
Paper · Github · Website

Evaluates LLMs on realistic multi-step tasks requiring tool use and cross-tool coordination through 28 MCP servers spanning 250 tools across finance, travel, scientific computing, and academic search.
MCP-Universe (Aug 2025) 📄 64 ⭐ 583
Paper · Github · Website

Evaluates LLMs on realistic tasks through 11 real-world MCP servers across 6 domains including location navigation, repository management, financial analysis, 3D design, browser automation, and web search.
REALM-Bench (Feb 2025) 📄 7 ⭐ 38
Paper · Github

Evaluates multi-agent systems on 14 progressively complex real-world planning and scheduling problems featuring multi-agent coordination, inter-dependencies, and dynamic disruptions.
tau-bench (Jun 2024) 📄 483 ⭐ 1.2k
Paper · Github

Evaluates agents on dynamic conversations with a simulated user in real-world domains, providing domain-specific API tools and policy guidelines across retail and airline scenarios.
Tool Decathlon (Oct 2025) 📄 28 ⭐ 336
Paper · Github · Website

Evaluates language agents across 108 tasks spanning 32 software applications and 604 tools, requiring multi-app workflows over approximately 20 interaction turns.
ToolComp (Jan 2025) 📄 7 ⭐ 4
Paper · Github

Evaluates multi-step tool-use reasoning with process supervision labels, enabling assessment of both final outcomes and intermediate reasoning steps.
UltraHorizon (Sep 2025) 📄 11 ⭐ 22
Paper · Github

Measures foundational agent capabilities for long-horizon partially observable tasks, with standard configurations involving 35K+ tokens and 60+ tool calls across exploration environments.
τ²-bench (Jun 2025) 📄 192 ⭐ 1.1k
Paper · Github · Website · Agentbeats

Evaluates agents in dual-control settings where both the agent and user modify shared state, exposing coordination and communication failures.

Memory

Evo-Memory (Nov 2025) 📄 44
Paper

A streaming benchmark for evaluating self-evolving memory in LLM agents across 10 diverse datasets, requiring agents to search, adapt, and update memory after each interaction.
LoCoMo (Feb 2024) 📄 379 ⭐ 805
Paper · Github · Website

Benchmarks long-term memory in language models across question answering, event summarization, and multimodal dialogue tasks using conversations spanning up to 35 sessions and 9K tokens on average.
🆕 LoCoMo-Plus (Feb 2026) 📄 1 ⭐ 20
Paper · Github

Benchmarks cognitive memory in LLM agents under cue-trigger semantic disconnect, testing agents on retaining and applying implicit constraints across long conversational contexts.
LongMemEval (Oct 2024) 📄 225 ⭐ 720
Paper · Github · Website

Benchmarks chat assistants on five long-term memory abilities - information extraction, multi-session reasoning, temporal reasoning, knowledge updates, and abstention - using 500 questions embedded in scalable chat histories.
Mem2ActBench (Jan 2026) 📄 2
Paper

Evaluates whether agents can proactively leverage long-term memory to execute tool-based actions, covering 2,029 sessions across 400 tool-use tasks.
🆕 MemoryArena (Feb 2026) 📄 9
Paper · Website

Evaluates agent memory in multi-session loops across web navigation, preference-constrained planning, progressive information search, and sequential formal reasoning.
MEMTRACK (Oct 2025) 📄 6
Paper

Evaluates long-term memory and state tracking in multi-platform agent environments, testing memory acquisition, selection, and conflict resolution across realistic organizational workflows spanning Slack, Linear, and Git.

Safety

Agent-SafetyBench (Dec 2024) 📄 149 ⭐ 132
Paper · Github · Dataset

Evaluates the safety of LLM agents across 8 categories of safety risks and 10 failure modes, covering 349 interaction environments and 2,000 test cases.
AgentAuditor (May 2025) 📄 31 ⭐ 3
Paper · Github

Benchmarks LLM-based evaluators on detecting safety risks and security threats in agent interactions, covering 2,293 annotated records across 15 risk types and 29 application scenarios.
MCP-SafetyBench (Dec 2025) 📄 14 ⭐ 17
Paper · Github · Website

Evaluates safety of LLMs using the Model Context Protocol across 5 domains with a taxonomy of 20 MCP attack types spanning server, host, and user sides.
🆕 MT-AgentRisk (Feb 2026) 📄 2 ⭐ 17
Paper · Github · Website

Evaluates multi-turn tool-using agent safety, revealing that attack success rates increase by approximately 16% in multi-turn settings compared to single-turn baselines.
OpenAgentSafety (Jul 2025) 📄 30 ⭐ 31
Paper · Github · Dataset

Evaluates agent behavior across 8 critical risk categories using real tools including web browsers, code execution, file systems, and bash shells across 350+ multi-turn tasks with both benign and adversarial intents.
OS-Harm (Jun 2025) 📄 43 ⭐ 64
Paper · Github

Evaluates safety of computer use agents on 150 tasks spanning three harm categories - deliberate user misuse, prompt injection attacks, and model misbehavior.

Scientific Research

AstaBench (Oct 2025) 📄 8 ⭐ 101
Paper · Github

Evaluates agents on 2,400+ problems spanning the full scientific discovery process across multiple domains, including literature review, experiment replication, data analysis, and research direction proposals.
InnovatorBench (Oct 2025) 📄 7 ⭐ 17
Paper · Github · Website

Evaluates agents on 20 code-driven LLM research tasks spanning data construction, filtering, augmentation, and model design, paired with the ResearchGym execution environment.
RE-Bench (Nov 2024) 📄 79 ⭐ 136
Paper · Github

Evaluates AI agents on 7 challenging open-ended ML research engineering tasks, benchmarking against human experts across 1,500+ hours of collected human baselines.

Security

Agent Security Bench (Oct 2024) 📄 182 ⭐ 227
Paper · Github · Website

Evaluates attacks and defenses of LLM-based agents across 10 scenarios, 10 agent types, over 400 tools, and 27 attack and defense methods.
CAIBench (Oct 2025) 📄 12
Paper · Github

A modular meta-benchmark for evaluating LLMs and agents across offensive and defensive cybersecurity domains, integrating over 10,000 instances across Jeopardy CTFs, Attack-Defense CTFs, Cyber Range exercises, and knowledge assessments.
CVE-Bench (Mar 2025) 📄 59 ⭐ 207
Paper · Github

Evaluates LLM agents on exploiting real-world web application vulnerabilities based on critical-severity CVEs, with state-of-the-art agents resolving approximately 13% of vulnerabilities.
CyberGym (Jun 2025) 📄 11 ⭐ 276
Paper · Github · Website

Evaluates agents on real-world vulnerability analysis by tasking them with generating proof-of-concept tests for known vulnerabilities given a natural language description and codebase. Covers 1,507 vulnerabilities across 188 open-source projects.
DoomArena (Apr 2025) 📄 22 ⭐ 57
Paper · Github

A security evaluation framework for AI agents that tests vulnerabilities across configurable threat models including malicious user and malicious environment scenarios, integrated with BrowserGym and tau-bench.
ExCyTIn-Bench (Jul 2025) 📄 8 ⭐ 121
Paper · Github · Website

Evaluates LLM agents on cyber threat investigation through 589 security questions derived from investigation graphs built from 8 simulated attacks across 57 log tables in an Azure environment.
SEC-bench (Jun 2025) 📄 26 ⭐ 70
Paper · Github

Evaluates LLM agents on authentic security engineering tasks - proof-of-concept generation and vulnerability patching - using an automated pipeline that constructs repositories, reproduces vulnerabilities, and generates gold patches.
WASP (Apr 2025) 📄 73 ⭐ 83
Paper · Github

Evaluates end-to-end security of web agents against prompt injection attacks in realistic scenarios, with attacks partially succeeding in up to 86% of cases.

Web

AssistantBench (Jul 2024) 📄 105 ⭐ 70
Paper · Github · Website

Evaluates web agents on 214 realistic time-consuming tasks that can be automatically evaluated across different scenarios and domains, with no model exceeding 26% accuracy.
BrowseComp (Apr 2025) 📄 361 ⭐ 4.5k
Paper · Github

Evaluates web browsing agents on 1,266 questions requiring persistent navigation to find entangled hard-to-locate information with short, verifiable answers.
BrowserArena (Oct 2025) 📄 8 ⭐ 4
Paper · Github

A live open-web agent evaluation platform that collects user-submitted tasks and runs Arena-style head-to-head comparisons with step-level human feedback for LLM web agents.
Mind2Web (Jun 2023) 📄 988 ⭐ 984
Paper · Github · Website

Evaluates generalist web agents on 2,000+ tasks across 137 websites and 31 domains.
Mind2Web 2 (Jun 2025) 📄 34 ⭐ 109
Paper · Github · Website

Evaluates agentic search systems on 130 long-horizon tasks requiring real-time web browsing, information synthesis, and citation-backed answers, using an Agent-as-a-Judge framework.
Online-Mind2Web (Apr 2025) 📄 90 ⭐ 171
Paper · Github · Leaderboard

Evaluates web agents on 300 diverse realistic tasks spanning 136 websites under conditions that approximate real user settings.
VisualWebArena (Jan 2024) ⭐ 464
Paper · Github · Website

Evaluates multimodal web agents on realistic visually grounded web tasks, requiring agents to process image-text inputs and execute actions on websites.
WebArena (Jul 2023) 📄 1.1k ⭐ 1.4k
Paper · Github · Website

Evaluates agents on realistic web tasks spanning e-commerce, forums, collaborative software dev, and content management.
🆕 WebArena-Infinity (Mar 2026) ⭐ 41
Github · Website

Generates realistic browser environments with verifiable tasks at scale using a multi-agent pipeline that builds, audits, and hardens web applications from static artifacts such as user manuals and recorded workflows. The initial release includes 10 environments, 1,260 verifiable tasks, and 2,070 successful browser-agent trajectories.
WebArena-Verified ⭐ 36
Github · Website

A curated, version-controlled set of 812 web agent tasks with deterministic evaluators supporting offline evaluation via network trace replay, eliminating LLM-based judging.
WebChoreArena (Jun 2025) 📄 19 ⭐ 34
Paper · Github · Website

Evaluates web browsing agents on 532 tedious real-world tasks focusing on massive memory requirements, mathematical calculations, and long-term memory across multiple webpages.
WebShop (Jul 2022) 📄 938 ⭐ 524
Paper · Github · Website

Evaluates agents on navigating and shopping in a simulated e-commerce environment with 1.18 million real-world products and 12,087 crowd-sourced text instructions.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.github/workflows		.github/workflows
scripts		scripts
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
README.md		README.md
benchmarks.yaml		benchmarks.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Awesome Agent Benchmarks

Table of Contents

Coding

Computer Use

Deep Research

Embodied

General Capabilities

Memory

Safety

Scientific Research

Security

Web

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Awesome Agent Benchmarks

Table of Contents

Coding

Computer Use

Deep Research

Embodied

General Capabilities

Memory

Safety

Scientific Research

Security

Web

About

Resources

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages