A curated list of agent benchmarks and evaluation frameworks.
- Coding (9)
- Computer Use (14)
- Deep Research (6)
- Embodied (4)
- General Capabilities (17)
- Memory (7)
- Safety (6)
- Scientific Research (3)
- Security (8)
- Web (12)
- Commit0 (Dec 2024) π 30 β 190
Paper Β· Github Β· WebsiteEvaluates AI agents on writing complete software libraries from scratch given specification documents and interactive unit test suites.
- LoCoBench-Agent (Nov 2025) π 7 β 19
Paper Β· GithubEvaluates LLM agents in realistic long-context software engineering workflows, testing multi-turn conversations, tool usage efficiency, error recovery, and architectural consistency across context lengths from 10K to 1M tokens.
- MLE-bench (Oct 2024) π 200 β 1.5k
Paper Β· Github Β· WebsiteMeasures agents on machine learning engineering using 75 competitions from Kaggle, testing skills including model training, dataset preparation, and experiment execution.
- π ProjDevBench (Feb 2026) π 4 β 17
Paper Β· GithubEvaluates coding agents on end-to-end project development across 20 programming problems in 8 categories, assessing system architecture design, functional correctness, and iterative solution refinement.
- SciCode (Jul 2024) π 102 β 191
Paper Β· GithubEvaluates agents on generating code to solve 80 real scientific research problems decomposed into 338 subproblems across 16 natural science fields including math, physics, chemistry, biology, and materials science.
- SWE-bench (Oct 2023) π 2.0k β 4.8k
Paper Β· Github Β· WebsiteEvaluates agents' ability to resolve real GitHub issues by editing codebases, drawn from 2,294 issues across 12 Python repositories.
- SWE-EVO (Dec 2025) π 10
PaperEvaluates coding agents on 48 long-horizon software evolution tasks requiring multi-step modifications spanning an average of 21 files, drawn from version histories of 7 mature open-source Python projects.
- SWE-rebench (May 2025) π 52
Paper Β· WebsiteA continuously refreshed benchmark of 21,000+ interactive Python-based software engineering tasks extracted from GitHub repositories, designed to prevent evaluation contamination.
- USACO (Apr 2024) π 67 β 124
Paper Β· Github Β· WebsiteEvaluates agents on 307 competitive programming problems from the USA Computing Olympiad, requiring complex algorithmic reasoning, puzzle solving, and efficient code generation.
- AndroidWorld (May 2024) π 261 β 744
Paper Β· GithubA dynamic benchmarking environment for autonomous agents on 116 programmatic tasks across 20 real-world Android apps, with tasks parameterized and expressed in natural language.
- CRAB (Jul 2024) π 34 β 416
Paper Β· Github Β· WebsiteEvaluates multimodal agents on 120 tasks across computer desktop and mobile phone environments using a graph-based fine-grained evaluation method.
- MobileWorld (Dec 2025) π 17
PaperEvaluates autonomous mobile agents on 201 tasks across 20 applications in agent-user interactive and MCP-augmented environments, emphasizing long-horizon cross-application workflows.
- OmniACT (Feb 2024) π 136
Paper Β· DatasetEvaluates agents on generating executable programs to complete computer tasks across desktop and web applications, using both text and visual UI understanding.
- OSUniverse (May 2025) π 7 β 24
Paper Β· Github Β· WebsiteEvaluates agents on complex multimodal desktop tasks of increasing difficulty, from basic UI interactions to multistep workflows requiring spatial reasoning and dexterity.
- OSWorld (Apr 2024) π 593 β 2.8k
Paper Β· Github Β· WebsiteEvaluates multimodal agents on 369 real-world computer tasks across OS, web, and desktop apps.
- OSWorld-MCP (Oct 2025) π 8 β 222
Paper Β· Github Β· WebsiteEvaluates computer-use agents on 158 tasks combining MCP tool invocation and GUI operations across 7 common applications in a real-world environment.
- SCUBA (Sep 2025) π 4 β 8
Paper Β· Github Β· WebsiteEvaluates computer-use agents on CRM workflows within Salesforce across 300 tasks spanning three personas - platform administrators, sales representatives, and service agents.
- Spider2-V (Jul 2024) π 50 β 151
Paper Β· Github Β· WebsiteEvaluates multimodal agents on 494 real-world data science and engineering tasks across 20 enterprise applications, requiring SQL queries, Python code, and GUI operations.
- Terminal-Bench 1.0 β 2.1k
Github Β· WebsiteEvaluates agents' ability to use a computer terminal on ~100 tasks.
- Terminal-Bench 2.0 (Jan 2026) π 67 β 207
Paper Β· Github Β· Website Β· AgentbeatsCurated hard benchmark of 89 tasks in computer terminal environments, spanning software engineering, ML, security, and data science.
- UI-CUBE (Nov 2025) π 2
Paper Β· Github Β· WebsiteEvaluates enterprise computer use agents on 226 tasks spanning simple UI interactions and complex workflows in enterprise applications, with multi-resolution testing and automated validation.
- WorkArena (Mar 2024) π 179 β 245
Paper Β· Github Β· WebsiteEvaluates agents on 33 knowledge worker tasks using the ServiceNow enterprise platform, measuring ability to handle realistic workplace software workflows.
- WorkArena++ (Jul 2024) π 45 β 245
Paper Β· Github Β· WebsiteEvaluates LLMs and vision-language models on 682 realistic enterprise workflow tasks requiring planning, problem-solving, logical reasoning, retrieval, and contextual understanding within ServiceNow.
- BrowseComp-Plus (Aug 2025) π 86 β 255
Paper Β· Github Β· WebsiteEvaluates deep-research agents using a fixed curated document corpus, enabling controlled experiments on retrieval methods, citation accuracy, and context engineering.
- DeepResearch Bench (Jun 2025) π 124 β 697
Paper Β· Github Β· WebsiteEvaluates deep research agents on 100 PhD-level research tasks across 22 fields crafted by domain experts, assessing report quality and information retrieval effectiveness.
- DeepScholar-Bench (Aug 2025) π 18 β 237
Paper Β· Github Β· WebsiteA live benchmark for generative research synthesis evaluating systems that retrieve from the live web and produce long-form cited reports, assessed on knowledge synthesis, retrieval quality, and verifiability.
- MMDeepResearch-Bench (Jan 2026) π 6 β 27
Paper Β· Github Β· WebsiteEvaluates multimodal deep research agents on 140 expert-crafted tasks across 21 domains, assessing citation-rich report synthesis where models must connect visual artifacts to sourced claims.
- PaperArena (Oct 2025) π 6 β 15
Paper Β· Github Β· WebsiteEvaluates LLM-based agents on questions requiring multi-paper information integration using external tools, with agents achieving only 38.78% average accuracy.
- WideSearch (Aug 2025) π 34 β 132
Paper Β· Github Β· WebsiteEvaluates agent reliability on large-scale information collection through 200 manually curated questions across 15+ domains requiring agents to gather and organize atomic verifiable information.
- ALFWorld (Oct 2020) π 791 β 724
Paper Β· Github Β· WebsiteA benchmark and simulator bridging abstract text-based planning with embodied visual task execution, enabling agents to learn policies in TextWorld and transfer them to ALFRED visual environments.
- BEHAVIOR-1K (Mar 2024) π 119 β 1.4k
Paper Β· Github Β· WebsiteA comprehensive simulation benchmark for human-centered robotics agents covering 1,000 everyday activities across 50 realistic scenes with over 9,000 annotated objects.
- MineAnyBuild (May 2025) π 1 β 14
Paper Β· GithubEvaluates agents on spatial planning in Minecraft across 4,000 curated tasks covering spatial understanding, reasoning, creativity, and commonsense, requiring generation of executable architecture building plans.
- Robotouille (Feb 2025) π 25 β 40
Paper Β· Github Β· WebsiteEvaluates LLM agents on long-horizon asynchronous kitchen planning scenarios, testing their capacity to manage overlapping tasks, interruptions, and parallel execution.
- AgencyBench (Jan 2026) π 7 β 77
Paper Β· Github Β· WebsiteEvaluates autonomous agents on 6 core agentic capabilities across 32 real-world scenarios comprising 138 tasks, each requiring approximately 90 tool calls and 1 million tokens.
- API-Bank (Apr 2023) π 400
Paper Β· GithubEvaluates LLMs on planning, retrieving, and calling APIs across 73 tools and 314 annotated tool-use dialogues containing 753 API calls.
- AppWorld (Jul 2024) π 144 β 406
Paper Β· Github Β· WebsiteEvaluates autonomous agents on 750 tasks requiring rich interactive code generation across 9 everyday applications accessible through 457 APIs, testing multi-app coordination and avoidance of unintended side effects.
- CRMArena (Nov 2024) π 43 β 136
Paper Β· GithubEvaluates AI agents on realistic customer service tasks across 9 task types and 3 personas using 16 industrial CRM objects with interconnected data simulating enterprise environments.
- DeepPlanning (Jan 2026) π 9
Paper Β· WebsiteEvaluates LLM agents on long-horizon planning through multi-day travel planning and shopping tasks requiring proactive information acquisition, local constrained reasoning, and global constrained optimization.
- GAIA (Nov 2023) π 685
Paper Β· WebsiteEvaluates general AI assistants on 466 real-world questions requiring reasoning, multi-modality handling, web browsing, and tool-use proficiency, with humans scoring 92% vs. 15% for GPT-4 with plugins.
- Gaia2 (Sep 2025) π 16 β 479
Paper Β· Github Β· WebsiteMeasures general agent capabilities in dynamic, asynchronous environments requiring agents to handle ambiguities, adapt to noise, collaborate with other agents, and operate under temporal constraints.
- GTA (Jul 2024) π 69 β 144
Paper Β· Github Β· WebsiteEvaluates LLMs on real-world tool use with 229 tasks using human-written queries, real deployed tools, and authentic multimodal inputs such as images, screenshots, and code files.
- HCAST (Mar 2025) π 9 β 20
Paper Β· GithubEvaluates AI agents on 189 tasks across ML engineering, cybersecurity, software engineering, and general reasoning, using human baselines collected over 1,500+ hours to calibrate difficulty.
- MCP-Bench (Aug 2025) π 47 β 475
Paper Β· Github Β· WebsiteEvaluates LLMs on realistic multi-step tasks requiring tool use and cross-tool coordination through 28 MCP servers spanning 250 tools across finance, travel, scientific computing, and academic search.
- MCP-Universe (Aug 2025) π 64 β 583
Paper Β· Github Β· WebsiteEvaluates LLMs on realistic tasks through 11 real-world MCP servers across 6 domains including location navigation, repository management, financial analysis, 3D design, browser automation, and web search.
- REALM-Bench (Feb 2025) π 7 β 38
Paper Β· GithubEvaluates multi-agent systems on 14 progressively complex real-world planning and scheduling problems featuring multi-agent coordination, inter-dependencies, and dynamic disruptions.
- tau-bench (Jun 2024) π 483 β 1.2k
Paper Β· GithubEvaluates agents on dynamic conversations with a simulated user in real-world domains, providing domain-specific API tools and policy guidelines across retail and airline scenarios.
- Tool Decathlon (Oct 2025) π 28 β 336
Paper Β· Github Β· WebsiteEvaluates language agents across 108 tasks spanning 32 software applications and 604 tools, requiring multi-app workflows over approximately 20 interaction turns.
- ToolComp (Jan 2025) π 7 β 4
Paper Β· GithubEvaluates multi-step tool-use reasoning with process supervision labels, enabling assessment of both final outcomes and intermediate reasoning steps.
- UltraHorizon (Sep 2025) π 11 β 22
Paper Β· GithubMeasures foundational agent capabilities for long-horizon partially observable tasks, with standard configurations involving 35K+ tokens and 60+ tool calls across exploration environments.
- ΟΒ²-bench (Jun 2025) π 192 β 1.1k
Paper Β· Github Β· Website Β· AgentbeatsEvaluates agents in dual-control settings where both the agent and user modify shared state, exposing coordination and communication failures.
- Evo-Memory (Nov 2025) π 44
PaperA streaming benchmark for evaluating self-evolving memory in LLM agents across 10 diverse datasets, requiring agents to search, adapt, and update memory after each interaction.
- LoCoMo (Feb 2024) π 379 β 805
Paper Β· Github Β· WebsiteBenchmarks long-term memory in language models across question answering, event summarization, and multimodal dialogue tasks using conversations spanning up to 35 sessions and 9K tokens on average.
- π LoCoMo-Plus (Feb 2026) π 1 β 20
Paper Β· GithubBenchmarks cognitive memory in LLM agents under cue-trigger semantic disconnect, testing agents on retaining and applying implicit constraints across long conversational contexts.
- LongMemEval (Oct 2024) π 225 β 720
Paper Β· Github Β· WebsiteBenchmarks chat assistants on five long-term memory abilities - information extraction, multi-session reasoning, temporal reasoning, knowledge updates, and abstention - using 500 questions embedded in scalable chat histories.
- Mem2ActBench (Jan 2026) π 2
PaperEvaluates whether agents can proactively leverage long-term memory to execute tool-based actions, covering 2,029 sessions across 400 tool-use tasks.
- π MemoryArena (Feb 2026) π 9
Paper Β· WebsiteEvaluates agent memory in multi-session loops across web navigation, preference-constrained planning, progressive information search, and sequential formal reasoning.
- MEMTRACK (Oct 2025) π 6
PaperEvaluates long-term memory and state tracking in multi-platform agent environments, testing memory acquisition, selection, and conflict resolution across realistic organizational workflows spanning Slack, Linear, and Git.
- Agent-SafetyBench (Dec 2024) π 149 β 132
Paper Β· Github Β· DatasetEvaluates the safety of LLM agents across 8 categories of safety risks and 10 failure modes, covering 349 interaction environments and 2,000 test cases.
- AgentAuditor (May 2025) π 31 β 3
Paper Β· GithubBenchmarks LLM-based evaluators on detecting safety risks and security threats in agent interactions, covering 2,293 annotated records across 15 risk types and 29 application scenarios.
- MCP-SafetyBench (Dec 2025) π 14 β 17
Paper Β· Github Β· WebsiteEvaluates safety of LLMs using the Model Context Protocol across 5 domains with a taxonomy of 20 MCP attack types spanning server, host, and user sides.
- π MT-AgentRisk (Feb 2026) π 2 β 17
Paper Β· Github Β· WebsiteEvaluates multi-turn tool-using agent safety, revealing that attack success rates increase by approximately 16% in multi-turn settings compared to single-turn baselines.
- OpenAgentSafety (Jul 2025) π 30 β 31
Paper Β· Github Β· DatasetEvaluates agent behavior across 8 critical risk categories using real tools including web browsers, code execution, file systems, and bash shells across 350+ multi-turn tasks with both benign and adversarial intents.
- OS-Harm (Jun 2025) π 43 β 64
Paper Β· GithubEvaluates safety of computer use agents on 150 tasks spanning three harm categories - deliberate user misuse, prompt injection attacks, and model misbehavior.
- AstaBench (Oct 2025) π 8 β 101
Paper Β· GithubEvaluates agents on 2,400+ problems spanning the full scientific discovery process across multiple domains, including literature review, experiment replication, data analysis, and research direction proposals.
- InnovatorBench (Oct 2025) π 7 β 17
Paper Β· Github Β· WebsiteEvaluates agents on 20 code-driven LLM research tasks spanning data construction, filtering, augmentation, and model design, paired with the ResearchGym execution environment.
- RE-Bench (Nov 2024) π 79 β 136
Paper Β· GithubEvaluates AI agents on 7 challenging open-ended ML research engineering tasks, benchmarking against human experts across 1,500+ hours of collected human baselines.
- Agent Security Bench (Oct 2024) π 182 β 227
Paper Β· Github Β· WebsiteEvaluates attacks and defenses of LLM-based agents across 10 scenarios, 10 agent types, over 400 tools, and 27 attack and defense methods.
- CAIBench (Oct 2025) π 12
Paper Β· GithubA modular meta-benchmark for evaluating LLMs and agents across offensive and defensive cybersecurity domains, integrating over 10,000 instances across Jeopardy CTFs, Attack-Defense CTFs, Cyber Range exercises, and knowledge assessments.
- CVE-Bench (Mar 2025) π 59 β 207
Paper Β· GithubEvaluates LLM agents on exploiting real-world web application vulnerabilities based on critical-severity CVEs, with state-of-the-art agents resolving approximately 13% of vulnerabilities.
- CyberGym (Jun 2025) π 11 β 276
Paper Β· Github Β· WebsiteEvaluates agents on real-world vulnerability analysis by tasking them with generating proof-of-concept tests for known vulnerabilities given a natural language description and codebase. Covers 1,507 vulnerabilities across 188 open-source projects.
- DoomArena (Apr 2025) π 22 β 57
Paper Β· GithubA security evaluation framework for AI agents that tests vulnerabilities across configurable threat models including malicious user and malicious environment scenarios, integrated with BrowserGym and tau-bench.
- ExCyTIn-Bench (Jul 2025) π 8 β 121
Paper Β· Github Β· WebsiteEvaluates LLM agents on cyber threat investigation through 589 security questions derived from investigation graphs built from 8 simulated attacks across 57 log tables in an Azure environment.
- SEC-bench (Jun 2025) π 26 β 70
Paper Β· GithubEvaluates LLM agents on authentic security engineering tasks - proof-of-concept generation and vulnerability patching - using an automated pipeline that constructs repositories, reproduces vulnerabilities, and generates gold patches.
- WASP (Apr 2025) π 73 β 83
Paper Β· GithubEvaluates end-to-end security of web agents against prompt injection attacks in realistic scenarios, with attacks partially succeeding in up to 86% of cases.
- AssistantBench (Jul 2024) π 105 β 70
Paper Β· Github Β· WebsiteEvaluates web agents on 214 realistic time-consuming tasks that can be automatically evaluated across different scenarios and domains, with no model exceeding 26% accuracy.
- BrowseComp (Apr 2025) π 361 β 4.5k
Paper Β· GithubEvaluates web browsing agents on 1,266 questions requiring persistent navigation to find entangled hard-to-locate information with short, verifiable answers.
- BrowserArena (Oct 2025) π 8 β 4
Paper Β· GithubA live open-web agent evaluation platform that collects user-submitted tasks and runs Arena-style head-to-head comparisons with step-level human feedback for LLM web agents.
- Mind2Web (Jun 2023) π 988 β 984
Paper Β· Github Β· WebsiteEvaluates generalist web agents on 2,000+ tasks across 137 websites and 31 domains.
- Mind2Web 2 (Jun 2025) π 34 β 109
Paper Β· Github Β· WebsiteEvaluates agentic search systems on 130 long-horizon tasks requiring real-time web browsing, information synthesis, and citation-backed answers, using an Agent-as-a-Judge framework.
- Online-Mind2Web (Apr 2025) π 90 β 171
Paper Β· Github Β· LeaderboardEvaluates web agents on 300 diverse realistic tasks spanning 136 websites under conditions that approximate real user settings.
- VisualWebArena (Jan 2024) β 464
Paper Β· Github Β· WebsiteEvaluates multimodal web agents on realistic visually grounded web tasks, requiring agents to process image-text inputs and execute actions on websites.
- WebArena (Jul 2023) π 1.1k β 1.4k
Paper Β· Github Β· WebsiteEvaluates agents on realistic web tasks spanning e-commerce, forums, collaborative software dev, and content management.
- π WebArena-Infinity (Mar 2026) β 41
Github Β· WebsiteGenerates realistic browser environments with verifiable tasks at scale using a multi-agent pipeline that builds, audits, and hardens web applications from static artifacts such as user manuals and recorded workflows. The initial release includes 10 environments, 1,260 verifiable tasks, and 2,070 successful browser-agent trajectories.
- WebArena-Verified β 36
Github Β· WebsiteA curated, version-controlled set of 812 web agent tasks with deterministic evaluators supporting offline evaluation via network trace replay, eliminating LLM-based judging.
- WebChoreArena (Jun 2025) π 19 β 34
Paper Β· Github Β· WebsiteEvaluates web browsing agents on 532 tedious real-world tasks focusing on massive memory requirements, mathematical calculations, and long-term memory across multiple webpages.
- WebShop (Jul 2022) π 938 β 524
Paper Β· Github Β· WebsiteEvaluates agents on navigating and shopping in a simulated e-commerce environment with 1.18 million real-world products and 12,087 crowd-sourced text instructions.