Skip to content

RDI-Foundation/awesome-agent-benchmarks

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

11 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Awesome Agent Benchmarks

Awesome @BerkeleyRDI Discord

A curated list of agent benchmarks and evaluation frameworks.

Table of Contents


Coding

  • Commit0 (Dec 2024) πŸ“„ 30 ⭐ 190
    Paper Β· Github Β· Website

    Evaluates AI agents on writing complete software libraries from scratch given specification documents and interactive unit test suites.

  • LoCoBench-Agent (Nov 2025) πŸ“„ 7 ⭐ 19
    Paper Β· Github

    Evaluates LLM agents in realistic long-context software engineering workflows, testing multi-turn conversations, tool usage efficiency, error recovery, and architectural consistency across context lengths from 10K to 1M tokens.

  • MLE-bench (Oct 2024) πŸ“„ 200 ⭐ 1.5k
    Paper Β· Github Β· Website

    Measures agents on machine learning engineering using 75 competitions from Kaggle, testing skills including model training, dataset preparation, and experiment execution.

  • πŸ†• ProjDevBench (Feb 2026) πŸ“„ 4 ⭐ 17
    Paper Β· Github

    Evaluates coding agents on end-to-end project development across 20 programming problems in 8 categories, assessing system architecture design, functional correctness, and iterative solution refinement.

  • SciCode (Jul 2024) πŸ“„ 102 ⭐ 191
    Paper Β· Github

    Evaluates agents on generating code to solve 80 real scientific research problems decomposed into 338 subproblems across 16 natural science fields including math, physics, chemistry, biology, and materials science.

  • SWE-bench (Oct 2023) πŸ“„ 2.0k ⭐ 4.8k
    Paper Β· Github Β· Website

    Evaluates agents' ability to resolve real GitHub issues by editing codebases, drawn from 2,294 issues across 12 Python repositories.

  • SWE-EVO (Dec 2025) πŸ“„ 10
    Paper

    Evaluates coding agents on 48 long-horizon software evolution tasks requiring multi-step modifications spanning an average of 21 files, drawn from version histories of 7 mature open-source Python projects.

  • SWE-rebench (May 2025) πŸ“„ 52
    Paper Β· Website

    A continuously refreshed benchmark of 21,000+ interactive Python-based software engineering tasks extracted from GitHub repositories, designed to prevent evaluation contamination.

  • USACO (Apr 2024) πŸ“„ 67 ⭐ 124
    Paper Β· Github Β· Website

    Evaluates agents on 307 competitive programming problems from the USA Computing Olympiad, requiring complex algorithmic reasoning, puzzle solving, and efficient code generation.

Computer Use

  • AndroidWorld (May 2024) πŸ“„ 261 ⭐ 744
    Paper Β· Github

    A dynamic benchmarking environment for autonomous agents on 116 programmatic tasks across 20 real-world Android apps, with tasks parameterized and expressed in natural language.

  • CRAB (Jul 2024) πŸ“„ 34 ⭐ 416
    Paper Β· Github Β· Website

    Evaluates multimodal agents on 120 tasks across computer desktop and mobile phone environments using a graph-based fine-grained evaluation method.

  • MobileWorld (Dec 2025) πŸ“„ 17
    Paper

    Evaluates autonomous mobile agents on 201 tasks across 20 applications in agent-user interactive and MCP-augmented environments, emphasizing long-horizon cross-application workflows.

  • OmniACT (Feb 2024) πŸ“„ 136
    Paper Β· Dataset

    Evaluates agents on generating executable programs to complete computer tasks across desktop and web applications, using both text and visual UI understanding.

  • OSUniverse (May 2025) πŸ“„ 7 ⭐ 24
    Paper Β· Github Β· Website

    Evaluates agents on complex multimodal desktop tasks of increasing difficulty, from basic UI interactions to multistep workflows requiring spatial reasoning and dexterity.

  • OSWorld (Apr 2024) πŸ“„ 593 ⭐ 2.8k
    Paper Β· Github Β· Website

    Evaluates multimodal agents on 369 real-world computer tasks across OS, web, and desktop apps.

  • OSWorld-MCP (Oct 2025) πŸ“„ 8 ⭐ 222
    Paper Β· Github Β· Website

    Evaluates computer-use agents on 158 tasks combining MCP tool invocation and GUI operations across 7 common applications in a real-world environment.

  • SCUBA (Sep 2025) πŸ“„ 4 ⭐ 8
    Paper Β· Github Β· Website

    Evaluates computer-use agents on CRM workflows within Salesforce across 300 tasks spanning three personas - platform administrators, sales representatives, and service agents.

  • Spider2-V (Jul 2024) πŸ“„ 50 ⭐ 151
    Paper Β· Github Β· Website

    Evaluates multimodal agents on 494 real-world data science and engineering tasks across 20 enterprise applications, requiring SQL queries, Python code, and GUI operations.

  • Terminal-Bench 1.0 ⭐ 2.1k
    Github Β· Website

    Evaluates agents' ability to use a computer terminal on ~100 tasks.

  • Terminal-Bench 2.0 (Jan 2026) πŸ“„ 67 ⭐ 207
    Paper Β· Github Β· Website Β· Agentbeats

    Curated hard benchmark of 89 tasks in computer terminal environments, spanning software engineering, ML, security, and data science.

  • UI-CUBE (Nov 2025) πŸ“„ 2
    Paper Β· Github Β· Website

    Evaluates enterprise computer use agents on 226 tasks spanning simple UI interactions and complex workflows in enterprise applications, with multi-resolution testing and automated validation.

  • WorkArena (Mar 2024) πŸ“„ 179 ⭐ 245
    Paper Β· Github Β· Website

    Evaluates agents on 33 knowledge worker tasks using the ServiceNow enterprise platform, measuring ability to handle realistic workplace software workflows.

  • WorkArena++ (Jul 2024) πŸ“„ 45 ⭐ 245
    Paper Β· Github Β· Website

    Evaluates LLMs and vision-language models on 682 realistic enterprise workflow tasks requiring planning, problem-solving, logical reasoning, retrieval, and contextual understanding within ServiceNow.

Deep Research

  • BrowseComp-Plus (Aug 2025) πŸ“„ 86 ⭐ 255
    Paper Β· Github Β· Website

    Evaluates deep-research agents using a fixed curated document corpus, enabling controlled experiments on retrieval methods, citation accuracy, and context engineering.

  • DeepResearch Bench (Jun 2025) πŸ“„ 124 ⭐ 697
    Paper Β· Github Β· Website

    Evaluates deep research agents on 100 PhD-level research tasks across 22 fields crafted by domain experts, assessing report quality and information retrieval effectiveness.

  • DeepScholar-Bench (Aug 2025) πŸ“„ 18 ⭐ 237
    Paper Β· Github Β· Website

    A live benchmark for generative research synthesis evaluating systems that retrieve from the live web and produce long-form cited reports, assessed on knowledge synthesis, retrieval quality, and verifiability.

  • MMDeepResearch-Bench (Jan 2026) πŸ“„ 6 ⭐ 27
    Paper Β· Github Β· Website

    Evaluates multimodal deep research agents on 140 expert-crafted tasks across 21 domains, assessing citation-rich report synthesis where models must connect visual artifacts to sourced claims.

  • PaperArena (Oct 2025) πŸ“„ 6 ⭐ 15
    Paper Β· Github Β· Website

    Evaluates LLM-based agents on questions requiring multi-paper information integration using external tools, with agents achieving only 38.78% average accuracy.

  • WideSearch (Aug 2025) πŸ“„ 34 ⭐ 132
    Paper Β· Github Β· Website

    Evaluates agent reliability on large-scale information collection through 200 manually curated questions across 15+ domains requiring agents to gather and organize atomic verifiable information.

Embodied

  • ALFWorld (Oct 2020) πŸ“„ 791 ⭐ 724
    Paper Β· Github Β· Website

    A benchmark and simulator bridging abstract text-based planning with embodied visual task execution, enabling agents to learn policies in TextWorld and transfer them to ALFRED visual environments.

  • BEHAVIOR-1K (Mar 2024) πŸ“„ 119 ⭐ 1.4k
    Paper Β· Github Β· Website

    A comprehensive simulation benchmark for human-centered robotics agents covering 1,000 everyday activities across 50 realistic scenes with over 9,000 annotated objects.

  • MineAnyBuild (May 2025) πŸ“„ 1 ⭐ 14
    Paper Β· Github

    Evaluates agents on spatial planning in Minecraft across 4,000 curated tasks covering spatial understanding, reasoning, creativity, and commonsense, requiring generation of executable architecture building plans.

  • Robotouille (Feb 2025) πŸ“„ 25 ⭐ 40
    Paper Β· Github Β· Website

    Evaluates LLM agents on long-horizon asynchronous kitchen planning scenarios, testing their capacity to manage overlapping tasks, interruptions, and parallel execution.

General Capabilities

  • AgencyBench (Jan 2026) πŸ“„ 7 ⭐ 77
    Paper Β· Github Β· Website

    Evaluates autonomous agents on 6 core agentic capabilities across 32 real-world scenarios comprising 138 tasks, each requiring approximately 90 tool calls and 1 million tokens.

  • API-Bank (Apr 2023) πŸ“„ 400
    Paper Β· Github

    Evaluates LLMs on planning, retrieving, and calling APIs across 73 tools and 314 annotated tool-use dialogues containing 753 API calls.

  • AppWorld (Jul 2024) πŸ“„ 144 ⭐ 406
    Paper Β· Github Β· Website

    Evaluates autonomous agents on 750 tasks requiring rich interactive code generation across 9 everyday applications accessible through 457 APIs, testing multi-app coordination and avoidance of unintended side effects.

  • CRMArena (Nov 2024) πŸ“„ 43 ⭐ 136
    Paper Β· Github

    Evaluates AI agents on realistic customer service tasks across 9 task types and 3 personas using 16 industrial CRM objects with interconnected data simulating enterprise environments.

  • DeepPlanning (Jan 2026) πŸ“„ 9
    Paper Β· Website

    Evaluates LLM agents on long-horizon planning through multi-day travel planning and shopping tasks requiring proactive information acquisition, local constrained reasoning, and global constrained optimization.

  • GAIA (Nov 2023) πŸ“„ 685
    Paper Β· Website

    Evaluates general AI assistants on 466 real-world questions requiring reasoning, multi-modality handling, web browsing, and tool-use proficiency, with humans scoring 92% vs. 15% for GPT-4 with plugins.

  • Gaia2 (Sep 2025) πŸ“„ 16 ⭐ 479
    Paper Β· Github Β· Website

    Measures general agent capabilities in dynamic, asynchronous environments requiring agents to handle ambiguities, adapt to noise, collaborate with other agents, and operate under temporal constraints.

  • GTA (Jul 2024) πŸ“„ 69 ⭐ 144
    Paper Β· Github Β· Website

    Evaluates LLMs on real-world tool use with 229 tasks using human-written queries, real deployed tools, and authentic multimodal inputs such as images, screenshots, and code files.

  • HCAST (Mar 2025) πŸ“„ 9 ⭐ 20
    Paper Β· Github

    Evaluates AI agents on 189 tasks across ML engineering, cybersecurity, software engineering, and general reasoning, using human baselines collected over 1,500+ hours to calibrate difficulty.

  • MCP-Bench (Aug 2025) πŸ“„ 47 ⭐ 475
    Paper Β· Github Β· Website

    Evaluates LLMs on realistic multi-step tasks requiring tool use and cross-tool coordination through 28 MCP servers spanning 250 tools across finance, travel, scientific computing, and academic search.

  • MCP-Universe (Aug 2025) πŸ“„ 64 ⭐ 583
    Paper Β· Github Β· Website

    Evaluates LLMs on realistic tasks through 11 real-world MCP servers across 6 domains including location navigation, repository management, financial analysis, 3D design, browser automation, and web search.

  • REALM-Bench (Feb 2025) πŸ“„ 7 ⭐ 38
    Paper Β· Github

    Evaluates multi-agent systems on 14 progressively complex real-world planning and scheduling problems featuring multi-agent coordination, inter-dependencies, and dynamic disruptions.

  • tau-bench (Jun 2024) πŸ“„ 483 ⭐ 1.2k
    Paper Β· Github

    Evaluates agents on dynamic conversations with a simulated user in real-world domains, providing domain-specific API tools and policy guidelines across retail and airline scenarios.

  • Tool Decathlon (Oct 2025) πŸ“„ 28 ⭐ 336
    Paper Β· Github Β· Website

    Evaluates language agents across 108 tasks spanning 32 software applications and 604 tools, requiring multi-app workflows over approximately 20 interaction turns.

  • ToolComp (Jan 2025) πŸ“„ 7 ⭐ 4
    Paper Β· Github

    Evaluates multi-step tool-use reasoning with process supervision labels, enabling assessment of both final outcomes and intermediate reasoning steps.

  • UltraHorizon (Sep 2025) πŸ“„ 11 ⭐ 22
    Paper Β· Github

    Measures foundational agent capabilities for long-horizon partially observable tasks, with standard configurations involving 35K+ tokens and 60+ tool calls across exploration environments.

  • τ²-bench (Jun 2025) πŸ“„ 192 ⭐ 1.1k
    Paper Β· Github Β· Website Β· Agentbeats

    Evaluates agents in dual-control settings where both the agent and user modify shared state, exposing coordination and communication failures.

Memory

  • Evo-Memory (Nov 2025) πŸ“„ 44
    Paper

    A streaming benchmark for evaluating self-evolving memory in LLM agents across 10 diverse datasets, requiring agents to search, adapt, and update memory after each interaction.

  • LoCoMo (Feb 2024) πŸ“„ 379 ⭐ 805
    Paper Β· Github Β· Website

    Benchmarks long-term memory in language models across question answering, event summarization, and multimodal dialogue tasks using conversations spanning up to 35 sessions and 9K tokens on average.

  • πŸ†• LoCoMo-Plus (Feb 2026) πŸ“„ 1 ⭐ 20
    Paper Β· Github

    Benchmarks cognitive memory in LLM agents under cue-trigger semantic disconnect, testing agents on retaining and applying implicit constraints across long conversational contexts.

  • LongMemEval (Oct 2024) πŸ“„ 225 ⭐ 720
    Paper Β· Github Β· Website

    Benchmarks chat assistants on five long-term memory abilities - information extraction, multi-session reasoning, temporal reasoning, knowledge updates, and abstention - using 500 questions embedded in scalable chat histories.

  • Mem2ActBench (Jan 2026) πŸ“„ 2
    Paper

    Evaluates whether agents can proactively leverage long-term memory to execute tool-based actions, covering 2,029 sessions across 400 tool-use tasks.

  • πŸ†• MemoryArena (Feb 2026) πŸ“„ 9
    Paper Β· Website

    Evaluates agent memory in multi-session loops across web navigation, preference-constrained planning, progressive information search, and sequential formal reasoning.

  • MEMTRACK (Oct 2025) πŸ“„ 6
    Paper

    Evaluates long-term memory and state tracking in multi-platform agent environments, testing memory acquisition, selection, and conflict resolution across realistic organizational workflows spanning Slack, Linear, and Git.

Safety

  • Agent-SafetyBench (Dec 2024) πŸ“„ 149 ⭐ 132
    Paper Β· Github Β· Dataset

    Evaluates the safety of LLM agents across 8 categories of safety risks and 10 failure modes, covering 349 interaction environments and 2,000 test cases.

  • AgentAuditor (May 2025) πŸ“„ 31 ⭐ 3
    Paper Β· Github

    Benchmarks LLM-based evaluators on detecting safety risks and security threats in agent interactions, covering 2,293 annotated records across 15 risk types and 29 application scenarios.

  • MCP-SafetyBench (Dec 2025) πŸ“„ 14 ⭐ 17
    Paper Β· Github Β· Website

    Evaluates safety of LLMs using the Model Context Protocol across 5 domains with a taxonomy of 20 MCP attack types spanning server, host, and user sides.

  • πŸ†• MT-AgentRisk (Feb 2026) πŸ“„ 2 ⭐ 17
    Paper Β· Github Β· Website

    Evaluates multi-turn tool-using agent safety, revealing that attack success rates increase by approximately 16% in multi-turn settings compared to single-turn baselines.

  • OpenAgentSafety (Jul 2025) πŸ“„ 30 ⭐ 31
    Paper Β· Github Β· Dataset

    Evaluates agent behavior across 8 critical risk categories using real tools including web browsers, code execution, file systems, and bash shells across 350+ multi-turn tasks with both benign and adversarial intents.

  • OS-Harm (Jun 2025) πŸ“„ 43 ⭐ 64
    Paper Β· Github

    Evaluates safety of computer use agents on 150 tasks spanning three harm categories - deliberate user misuse, prompt injection attacks, and model misbehavior.

Scientific Research

  • AstaBench (Oct 2025) πŸ“„ 8 ⭐ 101
    Paper Β· Github

    Evaluates agents on 2,400+ problems spanning the full scientific discovery process across multiple domains, including literature review, experiment replication, data analysis, and research direction proposals.

  • InnovatorBench (Oct 2025) πŸ“„ 7 ⭐ 17
    Paper Β· Github Β· Website

    Evaluates agents on 20 code-driven LLM research tasks spanning data construction, filtering, augmentation, and model design, paired with the ResearchGym execution environment.

  • RE-Bench (Nov 2024) πŸ“„ 79 ⭐ 136
    Paper Β· Github

    Evaluates AI agents on 7 challenging open-ended ML research engineering tasks, benchmarking against human experts across 1,500+ hours of collected human baselines.

Security

  • Agent Security Bench (Oct 2024) πŸ“„ 182 ⭐ 227
    Paper Β· Github Β· Website

    Evaluates attacks and defenses of LLM-based agents across 10 scenarios, 10 agent types, over 400 tools, and 27 attack and defense methods.

  • CAIBench (Oct 2025) πŸ“„ 12
    Paper Β· Github

    A modular meta-benchmark for evaluating LLMs and agents across offensive and defensive cybersecurity domains, integrating over 10,000 instances across Jeopardy CTFs, Attack-Defense CTFs, Cyber Range exercises, and knowledge assessments.

  • CVE-Bench (Mar 2025) πŸ“„ 59 ⭐ 207
    Paper Β· Github

    Evaluates LLM agents on exploiting real-world web application vulnerabilities based on critical-severity CVEs, with state-of-the-art agents resolving approximately 13% of vulnerabilities.

  • CyberGym (Jun 2025) πŸ“„ 11 ⭐ 276
    Paper Β· Github Β· Website

    Evaluates agents on real-world vulnerability analysis by tasking them with generating proof-of-concept tests for known vulnerabilities given a natural language description and codebase. Covers 1,507 vulnerabilities across 188 open-source projects.

  • DoomArena (Apr 2025) πŸ“„ 22 ⭐ 57
    Paper Β· Github

    A security evaluation framework for AI agents that tests vulnerabilities across configurable threat models including malicious user and malicious environment scenarios, integrated with BrowserGym and tau-bench.

  • ExCyTIn-Bench (Jul 2025) πŸ“„ 8 ⭐ 121
    Paper Β· Github Β· Website

    Evaluates LLM agents on cyber threat investigation through 589 security questions derived from investigation graphs built from 8 simulated attacks across 57 log tables in an Azure environment.

  • SEC-bench (Jun 2025) πŸ“„ 26 ⭐ 70
    Paper Β· Github

    Evaluates LLM agents on authentic security engineering tasks - proof-of-concept generation and vulnerability patching - using an automated pipeline that constructs repositories, reproduces vulnerabilities, and generates gold patches.

  • WASP (Apr 2025) πŸ“„ 73 ⭐ 83
    Paper Β· Github

    Evaluates end-to-end security of web agents against prompt injection attacks in realistic scenarios, with attacks partially succeeding in up to 86% of cases.

Web

  • AssistantBench (Jul 2024) πŸ“„ 105 ⭐ 70
    Paper Β· Github Β· Website

    Evaluates web agents on 214 realistic time-consuming tasks that can be automatically evaluated across different scenarios and domains, with no model exceeding 26% accuracy.

  • BrowseComp (Apr 2025) πŸ“„ 361 ⭐ 4.5k
    Paper Β· Github

    Evaluates web browsing agents on 1,266 questions requiring persistent navigation to find entangled hard-to-locate information with short, verifiable answers.

  • BrowserArena (Oct 2025) πŸ“„ 8 ⭐ 4
    Paper Β· Github

    A live open-web agent evaluation platform that collects user-submitted tasks and runs Arena-style head-to-head comparisons with step-level human feedback for LLM web agents.

  • Mind2Web (Jun 2023) πŸ“„ 988 ⭐ 984
    Paper Β· Github Β· Website

    Evaluates generalist web agents on 2,000+ tasks across 137 websites and 31 domains.

  • Mind2Web 2 (Jun 2025) πŸ“„ 34 ⭐ 109
    Paper Β· Github Β· Website

    Evaluates agentic search systems on 130 long-horizon tasks requiring real-time web browsing, information synthesis, and citation-backed answers, using an Agent-as-a-Judge framework.

  • Online-Mind2Web (Apr 2025) πŸ“„ 90 ⭐ 171
    Paper Β· Github Β· Leaderboard

    Evaluates web agents on 300 diverse realistic tasks spanning 136 websites under conditions that approximate real user settings.

  • VisualWebArena (Jan 2024) ⭐ 464
    Paper Β· Github Β· Website

    Evaluates multimodal web agents on realistic visually grounded web tasks, requiring agents to process image-text inputs and execute actions on websites.

  • WebArena (Jul 2023) πŸ“„ 1.1k ⭐ 1.4k
    Paper Β· Github Β· Website

    Evaluates agents on realistic web tasks spanning e-commerce, forums, collaborative software dev, and content management.

  • πŸ†• WebArena-Infinity (Mar 2026) ⭐ 41
    Github Β· Website

    Generates realistic browser environments with verifiable tasks at scale using a multi-agent pipeline that builds, audits, and hardens web applications from static artifacts such as user manuals and recorded workflows. The initial release includes 10 environments, 1,260 verifiable tasks, and 2,070 successful browser-agent trajectories.

  • WebArena-Verified ⭐ 36
    Github Β· Website

    A curated, version-controlled set of 812 web agent tasks with deterministic evaluators supporting offline evaluation via network trace replay, eliminating LLM-based judging.

  • WebChoreArena (Jun 2025) πŸ“„ 19 ⭐ 34
    Paper Β· Github Β· Website

    Evaluates web browsing agents on 532 tedious real-world tasks focusing on massive memory requirements, mathematical calculations, and long-term memory across multiple webpages.

  • WebShop (Jul 2022) πŸ“„ 938 ⭐ 524
    Paper Β· Github Β· Website

    Evaluates agents on navigating and shopping in a simulated e-commerce environment with 1.18 million real-world products and 12,087 crowd-sourced text instructions.

About

No description, website, or topics provided.

Resources

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages