Add webtailbench to external benchmarks by miguelg719 · Pull Request #1688 · browserbase/stagehand

miguelg719 · 2026-02-16T05:52:27Z

why

what changed

test plan

Summary by cubic

Adds WebTailBench as an external agent benchmark with dataset, task, suite fan-out, and CLI support. Also enables a navigate tool in Anthropic CUA that maps to goto and returns a post-navigation screenshot, and updates the evaluator prompt to pass tasks completed up to (but not including) purchase/booking steps.

New Features
- Added dataset: packages/evals/datasets/webtailbench/WebTailBench_data.jsonl.
- Implemented agent task (agent/webtailbench): starts at Google, collects up to 8 screenshots, resizes them, evaluates with V3Evaluator using a prompt that treats payment/booking as out of scope, and cleans up listeners.
- Suite + index wiring: fans out per model, respects datasetFilter=webtailbench, and sets task_category metadata.
- Config + CLI: task registered under external_agent_benchmarks (limit 25); CLI maps --benchmark webtailbench to agent/webtailbench.
- Anthropic CUA: added navigate tool schema; maps to goto; captures a screenshot after navigation and returns it with the current URL text when available.
Migration
- Run via --benchmark webtailbench or include agent/webtailbench; optionally set datasetFilter=webtailbench.
- Control cases with EVAL_MAX_K or EVAL_WEBTAILBENCH_LIMIT (default 25) and EVAL_WEBTAILBENCH_SAMPLE; cap steps with AGENT_EVAL_MAX_STEPS.

^{Written for commit 0a87f74. Summary will update on new commits. Review in cubic}

changeset-bot · 2026-02-16T05:52:31Z

⚠️ No Changeset found

Latest commit: 0a87f74

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

cubic-dev-ai

No issues found across 8 files

Confidence score: 5/5

Automated review surfaced no issues in the provided summaries.
No files require special attention.

Architecture diagram

sequenceDiagram
    participant CLI as CLI / Runner
    participant Suite as WebTailBench Suite
    participant Task as WebTailBench Task
    participant Agent as Stagehand V3 Agent
    participant Bus as V3 Event Bus
    participant Eval as V3 Evaluator

    Note over CLI, Eval: NEW: WebTailBench Evaluation Flow

    CLI->>Suite: buildWebTailBenchTestcases(models)
    Suite->>Suite: NEW: Read WebTailBench_data.jsonl
    Suite->>Suite: NEW: Fan out test cases (limit 25)
    Suite-->>CLI: Array of Testcases

    loop For each Testcase
        CLI->>Task: execute(v3, input, modelName)
        
        Task->>Task: CHANGED: Navigate to Google (startUrl)
        
        Task->>Agent: NEW: init Agent(cua: true)
        Note right of Agent: NEW: Includes 'navigate' tool<br/>mapping to goto + screenshot
        
        Task->>Bus: NEW: Subscribe to agent_screenshot_taken_event
        
        Task->>Agent: execute(instruction)
        
        loop Agent Steps
            Agent->>Agent: Reasoning & Tool Use
            opt Tool: navigate
                Agent->>Agent: NEW: Capture post-navigation screenshot
            end
            Agent-->>Bus: NEW: Emit screenshot buffer
            Bus-->>Task: NEW: ScreenshotCollector captures buffer
        end
        
        Agent-->>Task: agentResult (message/reasoning)
        
        Task->>Task: Stop collection & resize screenshots (0.7x)
        
        Task->>Eval: ask(question, screenshots, reasoning)
        Note right of Eval: Evaluation prompt ignores<br/>payment/booking steps
        
        Eval-->>Task: evaluation (YES/NO) + reasoning
        Task-->>CLI: success status & logs
    end

    Note over Task, Bus: Cleanup: Unsubscribe from Event Bus & clear buffers

greptile-apps · 2026-02-16T19:07:27Z

Greptile Summary

This PR adds WebTailBench as a new external agent benchmark (608 tasks covering flights and shopping) and enables a navigate tool in the Anthropic CUA client that maps to goto with a post-navigation screenshot response.

WebTailBench benchmark: New JSONL dataset, task file (agent/webtailbench), suite fan-out, and CLI/config wiring — all following the established patterns from webvoyager and onlineMind2Web.
Anthropic CUA navigate tool: Adds a custom tool schema, converts it to a goto action, and returns a screenshot + URL in the tool result. This enables the CUA model to navigate directly to URLs without needing to type them into the browser address bar.
Evaluator prompt: The webtailbench evaluator treats purchasing/booking steps as out of scope, marking tasks as passed if all pre-purchase steps were completed.
The dataset contains trailing \r characters in the ques field values which flow into agent and evaluator prompts — a minor data cleanliness issue worth addressing.
The navigate tool's URL reporting in tool results may reflect a stale currentUrl since the field is only initialized once, though this is consistent with the existing computer tool behavior.

Confidence Score: 4/5

This PR is safe to merge; it adds a new benchmark and a navigate tool following established patterns with no breaking changes to existing functionality.
The eval-side changes closely follow existing benchmark patterns (webvoyager, onlineMind2Web). The CUA navigate tool is a focused addition. Minor concerns: stale URL in navigate tool results and trailing \r characters in dataset, neither of which block merging.
packages/core/lib/v3/agent/AnthropicCUAClient.ts (navigate tool URL accuracy), packages/evals/datasets/webtailbench/WebTailBench_data.jsonl (data quality — trailing \r in ques field)

Important Files Changed

Filename	Overview
packages/core/lib/v3/agent/AnthropicCUAClient.ts	Adds a `navigate` custom tool (schema + action handling + tool result) that maps to `goto`. The screenshot and URL in tool results may use a stale `currentUrl` since it's only set once during initialization, though this is consistent with existing `computer` tool behavior.
packages/evals/tasks/agent/webtailbench.ts	New eval task for WebTailBench that follows established benchmark patterns (webvoyager, onlineMind2Web). Includes proper cleanup in finally block, event-based screenshot collection, and image resizing on buffers per project conventions.
packages/evals/suites/webtailbench.ts	New suite file that fans out JSONL dataset rows into per-model testcases. Follows the same patterns as other benchmark suites (gaia, webvoyager, onlineMind2Web) with env-driven limits and sampling.
packages/evals/index.eval.ts	Adds WebTailBench fan-out handling that mirrors the existing GAIA, WebVoyager, and Mind2Web patterns. Consistent dataset filter integration.
packages/evals/datasets/webtailbench/WebTailBench_data.jsonl	608-line JSONL dataset with flight and shopping tasks. Contains trailing `\r` characters in `ques` field values that get passed directly to LLM prompts.

Sequence Diagram

sequenceDiagram
    participant LLM as Anthropic CUA Model
    participant Client as AnthropicCUAClient
    participant Handler as v3CuaAgentHandler
    participant Page as Browser Page

    LLM->>Client: tool_use: navigate(url)
    Client->>Client: convertToolUseToAction → {type: "goto", url}
    Client->>Handler: actionHandler({type: "goto", url})
    Handler->>Page: page.goto(url, {waitUntil: "load"})
    Page-->>Handler: Navigation complete
    Handler->>Page: page.screenshot()
    Page-->>Handler: Screenshot buffer
    Handler-->>Client: Action complete
    Client->>Client: takeAction → captureScreenshot()
    Client->>Handler: screenshotProvider()
    Handler->>Page: page.screenshot()
    Page-->>Handler: Screenshot buffer (base64)
    Handler-->>Client: base64 image
    Client-->>LLM: tool_result: [image + "Navigated to: URL"]

_{Last reviewed commit: 0a87f74}

greptile-apps

_{8 files reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-02-16T19:07:31Z

packages/evals/datasets/webtailbench/WebTailBench_data.jsonl

@@ -0,0 +1,609 @@
+{"id":"united_13","category":"flights","ques":"What is the price difference between economy and business class on United Airlines direct flights from Chicago to São Paulo from 11/24/2025 to 12/14/2025? If there are no available flights for those dates, please indicate that in your answer\r","web":""}


Trailing \r in ques field values

Nearly all entries in this JSONL file have a literal \r (carriage return) embedded at the end of the ques string. For example, the first line's ques value ends with ...your answer\r". When parsed, this \r will be included in the string passed to both the agent instruction (agent.execute({ instruction: params.ques })) and the evaluator prompt. While this is unlikely to cause a hard failure, it introduces an invisible trailing character in every LLM prompt which could subtly affect tokenization and response behavior. Consider stripping \r from the dataset or trimming the ques values at parse time.

greptile-apps · 2026-02-16T19:07:32Z

packages/core/lib/v3/agent/AnthropicCUAClient.ts

+        } else if (item.name === "navigate") {
+          // For navigate tool, capture screenshot after navigation and return image
+          const screenshot = await this.captureScreenshot();


Navigate tool screenshot captured before navigation settles

The screenshot here is captured immediately after the action handler finishes page.goto() (which waits for "load"). However, there is no explicit wait for the page URL to be updated on this.currentUrl after the goto action. If setCurrentUrl is called asynchronously (e.g., via a page URL change listener), the this.currentUrl check on line 656 may still reflect the old URL rather than the navigated one.

For the computer tool (line 591), this is fine because the URL was already current. But for navigate, the value of this.currentUrl at screenshot time depends on the timing of URL updates. This may cause the tool result to say "Navigated to: [old URL]" instead of the actual destination.

Add webtailbench to external benchmarks

ece914c

miguelg719 added 4 commits February 15, 2026 22:05

update screenshot count

c848119

update cli

32e9c37

enable goto on anthropic cua

89261d6

update evaluator prompt for webtailbench

0a87f74

miguelg719 marked this pull request as ready for review February 16, 2026 19:02

cubic-dev-ai bot reviewed Feb 16, 2026

View reviewed changes

greptile-apps bot reviewed Feb 16, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add webtailbench to external benchmarks#1688

Add webtailbench to external benchmarks#1688
miguelg719 wants to merge 5 commits intomainfrom
webtailbench

miguelg719 commented Feb 16, 2026 •

edited by cubic-dev-ai bot

Loading

Uh oh!

changeset-bot bot commented Feb 16, 2026 •

edited

Loading

Uh oh!

cubic-dev-ai bot left a comment

Uh oh!

greptile-apps bot commented Feb 16, 2026

Uh oh!

greptile-apps bot left a comment

Uh oh!

greptile-apps bot Feb 16, 2026

Uh oh!

greptile-apps bot Feb 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		@@ -0,0 +1,609 @@
		{"id":"united_13","category":"flights","ques":"What is the price difference between economy and business class on United Airlines direct flights from Chicago to São Paulo from 11/24/2025 to 12/14/2025? If there are no available flights for those dates, please indicate that in your answer\r","web":""}

Conversation

miguelg719 commented Feb 16, 2026 • edited by cubic-dev-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

why

what changed

test plan

Summary by cubic

Uh oh!

changeset-bot bot commented Feb 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ No Changeset found

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot commented Feb 16, 2026

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

miguelg719 commented Feb 16, 2026 •

edited by cubic-dev-ai bot

Loading

changeset-bot bot commented Feb 16, 2026 •

edited

Loading