Skip to content

Add webtailbench to external benchmarks#1688

Open
miguelg719 wants to merge 5 commits intomainfrom
webtailbench
Open

Add webtailbench to external benchmarks#1688
miguelg719 wants to merge 5 commits intomainfrom
webtailbench

Conversation

@miguelg719
Copy link
Collaborator

@miguelg719 miguelg719 commented Feb 16, 2026

why

what changed

test plan


Summary by cubic

Adds WebTailBench as an external agent benchmark with dataset, task, suite fan-out, and CLI support. Also enables a navigate tool in Anthropic CUA that maps to goto and returns a post-navigation screenshot, and updates the evaluator prompt to pass tasks completed up to (but not including) purchase/booking steps.

  • New Features

    • Added dataset: packages/evals/datasets/webtailbench/WebTailBench_data.jsonl.
    • Implemented agent task (agent/webtailbench): starts at Google, collects up to 8 screenshots, resizes them, evaluates with V3Evaluator using a prompt that treats payment/booking as out of scope, and cleans up listeners.
    • Suite + index wiring: fans out per model, respects datasetFilter=webtailbench, and sets task_category metadata.
    • Config + CLI: task registered under external_agent_benchmarks (limit 25); CLI maps --benchmark webtailbench to agent/webtailbench.
    • Anthropic CUA: added navigate tool schema; maps to goto; captures a screenshot after navigation and returns it with the current URL text when available.
  • Migration

    • Run via --benchmark webtailbench or include agent/webtailbench; optionally set datasetFilter=webtailbench.
    • Control cases with EVAL_MAX_K or EVAL_WEBTAILBENCH_LIMIT (default 25) and EVAL_WEBTAILBENCH_SAMPLE; cap steps with AGENT_EVAL_MAX_STEPS.

Written for commit 0a87f74. Summary will update on new commits. Review in cubic

@changeset-bot
Copy link

changeset-bot bot commented Feb 16, 2026

⚠️ No Changeset found

Latest commit: 0a87f74

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@miguelg719 miguelg719 marked this pull request as ready for review February 16, 2026 19:02
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 8 files

Confidence score: 5/5

  • Automated review surfaced no issues in the provided summaries.
  • No files require special attention.
Architecture diagram
sequenceDiagram
    participant CLI as CLI / Runner
    participant Suite as WebTailBench Suite
    participant Task as WebTailBench Task
    participant Agent as Stagehand V3 Agent
    participant Bus as V3 Event Bus
    participant Eval as V3 Evaluator

    Note over CLI, Eval: NEW: WebTailBench Evaluation Flow

    CLI->>Suite: buildWebTailBenchTestcases(models)
    Suite->>Suite: NEW: Read WebTailBench_data.jsonl
    Suite->>Suite: NEW: Fan out test cases (limit 25)
    Suite-->>CLI: Array of Testcases

    loop For each Testcase
        CLI->>Task: execute(v3, input, modelName)
        
        Task->>Task: CHANGED: Navigate to Google (startUrl)
        
        Task->>Agent: NEW: init Agent(cua: true)
        Note right of Agent: NEW: Includes 'navigate' tool<br/>mapping to goto + screenshot
        
        Task->>Bus: NEW: Subscribe to agent_screenshot_taken_event
        
        Task->>Agent: execute(instruction)
        
        loop Agent Steps
            Agent->>Agent: Reasoning & Tool Use
            opt Tool: navigate
                Agent->>Agent: NEW: Capture post-navigation screenshot
            end
            Agent-->>Bus: NEW: Emit screenshot buffer
            Bus-->>Task: NEW: ScreenshotCollector captures buffer
        end
        
        Agent-->>Task: agentResult (message/reasoning)
        
        Task->>Task: Stop collection & resize screenshots (0.7x)
        
        Task->>Eval: ask(question, screenshots, reasoning)
        Note right of Eval: Evaluation prompt ignores<br/>payment/booking steps
        
        Eval-->>Task: evaluation (YES/NO) + reasoning
        Task-->>CLI: success status & logs
    end

    Note over Task, Bus: Cleanup: Unsubscribe from Event Bus & clear buffers
Loading

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 16, 2026

Greptile Summary

This PR adds WebTailBench as a new external agent benchmark (608 tasks covering flights and shopping) and enables a navigate tool in the Anthropic CUA client that maps to goto with a post-navigation screenshot response.

  • WebTailBench benchmark: New JSONL dataset, task file (agent/webtailbench), suite fan-out, and CLI/config wiring — all following the established patterns from webvoyager and onlineMind2Web.
  • Anthropic CUA navigate tool: Adds a custom tool schema, converts it to a goto action, and returns a screenshot + URL in the tool result. This enables the CUA model to navigate directly to URLs without needing to type them into the browser address bar.
  • Evaluator prompt: The webtailbench evaluator treats purchasing/booking steps as out of scope, marking tasks as passed if all pre-purchase steps were completed.
  • The dataset contains trailing \r characters in the ques field values which flow into agent and evaluator prompts — a minor data cleanliness issue worth addressing.
  • The navigate tool's URL reporting in tool results may reflect a stale currentUrl since the field is only initialized once, though this is consistent with the existing computer tool behavior.

Confidence Score: 4/5

  • This PR is safe to merge; it adds a new benchmark and a navigate tool following established patterns with no breaking changes to existing functionality.
  • The eval-side changes closely follow existing benchmark patterns (webvoyager, onlineMind2Web). The CUA navigate tool is a focused addition. Minor concerns: stale URL in navigate tool results and trailing \r characters in dataset, neither of which block merging.
  • packages/core/lib/v3/agent/AnthropicCUAClient.ts (navigate tool URL accuracy), packages/evals/datasets/webtailbench/WebTailBench_data.jsonl (data quality — trailing \r in ques field)

Important Files Changed

Filename Overview
packages/core/lib/v3/agent/AnthropicCUAClient.ts Adds a navigate custom tool (schema + action handling + tool result) that maps to goto. The screenshot and URL in tool results may use a stale currentUrl since it's only set once during initialization, though this is consistent with existing computer tool behavior.
packages/evals/tasks/agent/webtailbench.ts New eval task for WebTailBench that follows established benchmark patterns (webvoyager, onlineMind2Web). Includes proper cleanup in finally block, event-based screenshot collection, and image resizing on buffers per project conventions.
packages/evals/suites/webtailbench.ts New suite file that fans out JSONL dataset rows into per-model testcases. Follows the same patterns as other benchmark suites (gaia, webvoyager, onlineMind2Web) with env-driven limits and sampling.
packages/evals/index.eval.ts Adds WebTailBench fan-out handling that mirrors the existing GAIA, WebVoyager, and Mind2Web patterns. Consistent dataset filter integration.
packages/evals/datasets/webtailbench/WebTailBench_data.jsonl 608-line JSONL dataset with flight and shopping tasks. Contains trailing \r characters in ques field values that get passed directly to LLM prompts.

Sequence Diagram

sequenceDiagram
    participant LLM as Anthropic CUA Model
    participant Client as AnthropicCUAClient
    participant Handler as v3CuaAgentHandler
    participant Page as Browser Page

    LLM->>Client: tool_use: navigate(url)
    Client->>Client: convertToolUseToAction → {type: "goto", url}
    Client->>Handler: actionHandler({type: "goto", url})
    Handler->>Page: page.goto(url, {waitUntil: "load"})
    Page-->>Handler: Navigation complete
    Handler->>Page: page.screenshot()
    Page-->>Handler: Screenshot buffer
    Handler-->>Client: Action complete
    Client->>Client: takeAction → captureScreenshot()
    Client->>Handler: screenshotProvider()
    Handler->>Page: page.screenshot()
    Page-->>Handler: Screenshot buffer (base64)
    Handler-->>Client: base64 image
    Client-->>LLM: tool_result: [image + "Navigated to: URL"]
Loading

Last reviewed commit: 0a87f74

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

8 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

@@ -0,0 +1,609 @@
{"id":"united_13","category":"flights","ques":"What is the price difference between economy and business class on United Airlines direct flights from Chicago to São Paulo from 11/24/2025 to 12/14/2025? If there are no available flights for those dates, please indicate that in your answer\r","web":""}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trailing \r in ques field values

Nearly all entries in this JSONL file have a literal \r (carriage return) embedded at the end of the ques string. For example, the first line's ques value ends with ...your answer\r". When parsed, this \r will be included in the string passed to both the agent instruction (agent.execute({ instruction: params.ques })) and the evaluator prompt. While this is unlikely to cause a hard failure, it introduces an invisible trailing character in every LLM prompt which could subtly affect tokenization and response behavior. Consider stripping \r from the dataset or trimming the ques values at parse time.

Comment on lines +636 to +638
} else if (item.name === "navigate") {
// For navigate tool, capture screenshot after navigation and return image
const screenshot = await this.captureScreenshot();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Navigate tool screenshot captured before navigation settles

The screenshot here is captured immediately after the action handler finishes page.goto() (which waits for "load"). However, there is no explicit wait for the page URL to be updated on this.currentUrl after the goto action. If setCurrentUrl is called asynchronously (e.g., via a page URL change listener), the this.currentUrl check on line 656 may still reflect the old URL rather than the navigated one.

For the computer tool (line 591), this is fine because the URL was already current. But for navigate, the value of this.currentUrl at screenshot time depends on the timing of URL updates. This may cause the tool result to say "Navigated to: [old URL]" instead of the actual destination.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant