Conversation
|
There was a problem hiding this comment.
No issues found across 8 files
Confidence score: 5/5
- Automated review surfaced no issues in the provided summaries.
- No files require special attention.
Architecture diagram
sequenceDiagram
participant CLI as CLI / Runner
participant Suite as WebTailBench Suite
participant Task as WebTailBench Task
participant Agent as Stagehand V3 Agent
participant Bus as V3 Event Bus
participant Eval as V3 Evaluator
Note over CLI, Eval: NEW: WebTailBench Evaluation Flow
CLI->>Suite: buildWebTailBenchTestcases(models)
Suite->>Suite: NEW: Read WebTailBench_data.jsonl
Suite->>Suite: NEW: Fan out test cases (limit 25)
Suite-->>CLI: Array of Testcases
loop For each Testcase
CLI->>Task: execute(v3, input, modelName)
Task->>Task: CHANGED: Navigate to Google (startUrl)
Task->>Agent: NEW: init Agent(cua: true)
Note right of Agent: NEW: Includes 'navigate' tool<br/>mapping to goto + screenshot
Task->>Bus: NEW: Subscribe to agent_screenshot_taken_event
Task->>Agent: execute(instruction)
loop Agent Steps
Agent->>Agent: Reasoning & Tool Use
opt Tool: navigate
Agent->>Agent: NEW: Capture post-navigation screenshot
end
Agent-->>Bus: NEW: Emit screenshot buffer
Bus-->>Task: NEW: ScreenshotCollector captures buffer
end
Agent-->>Task: agentResult (message/reasoning)
Task->>Task: Stop collection & resize screenshots (0.7x)
Task->>Eval: ask(question, screenshots, reasoning)
Note right of Eval: Evaluation prompt ignores<br/>payment/booking steps
Eval-->>Task: evaluation (YES/NO) + reasoning
Task-->>CLI: success status & logs
end
Note over Task, Bus: Cleanup: Unsubscribe from Event Bus & clear buffers
Greptile SummaryThis PR adds WebTailBench as a new external agent benchmark (608 tasks covering flights and shopping) and enables a
Confidence Score: 4/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant LLM as Anthropic CUA Model
participant Client as AnthropicCUAClient
participant Handler as v3CuaAgentHandler
participant Page as Browser Page
LLM->>Client: tool_use: navigate(url)
Client->>Client: convertToolUseToAction → {type: "goto", url}
Client->>Handler: actionHandler({type: "goto", url})
Handler->>Page: page.goto(url, {waitUntil: "load"})
Page-->>Handler: Navigation complete
Handler->>Page: page.screenshot()
Page-->>Handler: Screenshot buffer
Handler-->>Client: Action complete
Client->>Client: takeAction → captureScreenshot()
Client->>Handler: screenshotProvider()
Handler->>Page: page.screenshot()
Page-->>Handler: Screenshot buffer (base64)
Handler-->>Client: base64 image
Client-->>LLM: tool_result: [image + "Navigated to: URL"]
Last reviewed commit: 0a87f74 |
| @@ -0,0 +1,609 @@ | |||
| {"id":"united_13","category":"flights","ques":"What is the price difference between economy and business class on United Airlines direct flights from Chicago to São Paulo from 11/24/2025 to 12/14/2025? If there are no available flights for those dates, please indicate that in your answer\r","web":""} | |||
There was a problem hiding this comment.
Trailing \r in ques field values
Nearly all entries in this JSONL file have a literal \r (carriage return) embedded at the end of the ques string. For example, the first line's ques value ends with ...your answer\r". When parsed, this \r will be included in the string passed to both the agent instruction (agent.execute({ instruction: params.ques })) and the evaluator prompt. While this is unlikely to cause a hard failure, it introduces an invisible trailing character in every LLM prompt which could subtly affect tokenization and response behavior. Consider stripping \r from the dataset or trimming the ques values at parse time.
| } else if (item.name === "navigate") { | ||
| // For navigate tool, capture screenshot after navigation and return image | ||
| const screenshot = await this.captureScreenshot(); |
There was a problem hiding this comment.
Navigate tool screenshot captured before navigation settles
The screenshot here is captured immediately after the action handler finishes page.goto() (which waits for "load"). However, there is no explicit wait for the page URL to be updated on this.currentUrl after the goto action. If setCurrentUrl is called asynchronously (e.g., via a page URL change listener), the this.currentUrl check on line 656 may still reflect the old URL rather than the navigated one.
For the computer tool (line 591), this is fine because the URL was already current. But for navigate, the value of this.currentUrl at screenshot time depends on the timing of URL updates. This may cause the tool result to say "Navigated to: [old URL]" instead of the actual destination.
why
what changed
test plan
Summary by cubic
Adds WebTailBench as an external agent benchmark with dataset, task, suite fan-out, and CLI support. Also enables a navigate tool in Anthropic CUA that maps to goto and returns a post-navigation screenshot, and updates the evaluator prompt to pass tasks completed up to (but not including) purchase/booking steps.
New Features
Migration
Written for commit 0a87f74. Summary will update on new commits. Review in cubic