Skip to content

Add Stagehand v4 eval harness#2086

Open
pirate wants to merge 12 commits intomainfrom
evals-v4-sdk-tools
Open

Add Stagehand v4 eval harness#2086
pirate wants to merge 12 commits intomainfrom
evals-v4-sdk-tools

Conversation

@pirate
Copy link
Copy Markdown
Member

@pirate pirate commented May 6, 2026

Summary

  • split bench harness implementations into side-by-side Stagehand v3, Stagehand v4, Claude, and Codex harness files
  • add an Understudy v4 tool bridge that derives tool names/schemas from the v4 protocol SDK instead of redefining the tool surface in evals
  • make the v4 harness lazy-load the local v4 SDK only when stagehand_v4 is selected, with a clean missing-SDK path error
  • print v4 bus.logTree() at cleanup when eval verbose mode is enabled
Screenshot 2026-05-06 at 3 08 37 AM

https://www.braintrust.dev/app/Browserbase/p/stagehand-dev/experiments/observe%2Fobserve_simple_google_search-2a092abe?c=observe/observe_simple_google_search-cd76e3b1&r=4f104beb-fee2-42c0-9f9c-17fd2bc240f3&s=5d806aa7-7e45-4cc2-ad91-0965e396efdb

Validation

  • pnpm --filter @browserbasehq/stagehand-evals run typecheck
  • pnpm --filter @browserbasehq/stagehand-evals run test:unit
  • EVAL_HEADLESS=false pnpm -w exec tsx packages/evals/cli.ts run observe/observe_simple_google_search --harness stagehand_v4 -t 1 -c 1 -m anthropic/claude-haiku-4-5 -p anthropic
  • STAGEHAND_V4_SDK_PATH=/tmp/missing-stagehand-v4-sdk.ts pnpm -w exec tsx packages/evals/cli.ts run observe/observe_simple_google_search --harness stagehand_v4 -t 1 -c 1 -m anthropic/claude-haiku-4-5 -p anthropic

Summary by cubic

Adds a stagehand_v4 eval harness that runs the v3 agent loop against the v4 SDK via a native tool bridge, with a v4-backed page facade and assertions. Also renames understudy_code to understudy_v3_code, sets stagehand_v3 as the default harness, and enables harness‑native bench implementations.

  • New Features

    • Added StagehandAgentV4Harness with UnderstudyV4Tools: derives tool catalog from the v4 SDK, lazy-loads the SDK, blocks --api, exposes ctx.v4, prints v4 bus.logTree() on verbose cleanup, installs a v4-backed page facade (goto/evaluate/waitForLoadState/locator) with fixed load-state and eval-target tracking, adapts flattened action params, returns page text when extract has no schema, and uses v4 element info for locator assertions.
    • Split and registered harnesses: added StagehandAgentV3Harness, ClaudeAgentHarness, and CodexAgentHarness; runner now selects harness-native implementations via defineBenchTask(...benchFns).
    • Planner/CLI updates: default harness is stagehand_v3; default core tool is understudy_v3_code; help/tests updated; CLI enforces that only stagehand_v3 supports --api.
  • Migration

    • Use --harness stagehand_v3 (default) to keep current behavior, or --harness stagehand_v4 (no --api) to use the v4 SDK.
    • Replace understudy_code with understudy_v3_code. CLI defaults, help, and tests reflect the new name.
    • stagehand_v4 requires a local v4 SDK; set STAGEHAND_V4_SDK_PATH or rely on the default path. The CLI validates presence before running.

Written for commit 8b9cf5b. Summary will update on new commits.

@changeset-bot

This comment was marked as resolved.

cubic-dev-ai[bot]

This comment was marked as resolved.

cubic-dev-ai[bot]

This comment was marked as resolved.

cubic-dev-ai[bot]

This comment was marked as resolved.

cubic-dev-ai[bot]

This comment was marked as resolved.

Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com>
@cubic-dev-ai

This comment was marked as resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant