-
Notifications
You must be signed in to change notification settings - Fork 65
More tool improvements and addition of eval CLI #90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
11 commits
Select commit
Hold shift + click to select a range
6e742e3
wip: eval runner and tool improvements
tysonthomas9 c25be68
Additional updates to universal agent sdk
tysonthomas9 d3ae58b
fix issues
tysonthomas9 d52f8a0
Fix build issue and add focus selector
tysonthomas9 3ece01d
Improve hybrid dom
tysonthomas9 0d6020e
more updates
tysonthomas9 438618d
Revert to v0 version
tysonthomas9 b32030f
minor fixes
tysonthomas9 3673e93
fix build issue
tysonthomas9 7bd65aa
fix pr comments
tysonthomas9 84ba1af
more refactors
tysonthomas9 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,10 @@ | ||
| # API Keys for Eval Runner | ||
| # Copy this file to .env and fill in your keys | ||
|
|
||
| # Agent LLM providers | ||
| CEREBRAS_API_KEY=your-cerebras-api-key | ||
| OPENAI_API_KEY=your-openai-api-key | ||
| ANTHROPIC_API_KEY=your-anthropic-api-key | ||
|
|
||
| # Optional: Braintrust for experiment tracking | ||
| BRAINTRUST_API_KEY=your-braintrust-api-key |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,206 @@ | ||
| # CLAUDE.md | ||
|
|
||
| This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. | ||
|
|
||
| ## Project Overview | ||
|
|
||
| **Browser Operator** is an AI-native browser built on Chrome DevTools frontend. It adds a multi-agent AI framework to the DevTools panel, enabling intelligent automation and web interaction through specialized AI agents. | ||
|
|
||
| ## Build & Development Commands | ||
|
|
||
| ### Initial Setup | ||
|
|
||
| ```bash | ||
| # Prerequisites: depot_tools in PATH (https://chromium.googlesource.com/chromium/tools/depot_tools.git) | ||
| gclient sync | ||
| npm install | ||
| cp .env.example .env # Configure API keys | ||
| ``` | ||
|
|
||
| ### Build | ||
|
|
||
| ```bash | ||
| npm run build # Standard build (runs gn gen automatically) | ||
| npm run build -- --watch # Watch mode for development | ||
| npm run build -- -t Debug # Build to out/Debug instead of out/Default | ||
|
|
||
| # Fast build (skip type checking and bundling) | ||
| gn gen out/fast-build --args="devtools_skip_typecheck=true devtools_bundle=false" | ||
| npm run build -- -t fast-build | ||
| ``` | ||
|
|
||
| ### Running DevTools with Custom Build | ||
|
|
||
| ```bash | ||
| # Terminal 1: Build with watch | ||
| npm run build -- --watch | ||
|
|
||
| # Terminal 2: Serve the built files | ||
| cd out/Default/gen/front_end && python3 -m http.server 9000 | ||
|
|
||
| # Terminal 3: Launch Browser Operator with custom DevTools | ||
| /Applications/Browser\ Operator.app/Contents/MacOS/Browser\ Operator \ | ||
| --disable-infobars \ | ||
| --custom-devtools-frontend=http://localhost:9000/ \ | ||
| --remote-debugging-port=9222 | ||
| ``` | ||
|
|
||
| ### Testing | ||
|
|
||
| ```bash | ||
| npm run test # Unit tests (Karma/Mocha) | ||
| npm run webtest # E2E tests (Puppeteer) | ||
| npm run debug-webtest -- --spec=path/to/test # Debug specific test | ||
| npm run lint # ESLint | ||
| ``` | ||
|
|
||
| ### Eval Runner (Agent Testing) | ||
|
|
||
| **Recommended: Use the eval-runner-analyst agent** to run evals and get detailed analysis: | ||
|
|
||
| ``` | ||
| # In Claude Code, use the Task tool with eval-runner-analyst agent: | ||
| "Run the action agent evals with cerebras gpt-oss-120b" | ||
| "Test action-agent-checkbox-001 and action-agent-form-001" | ||
| "Compare V0 and V1 action agents on iframe tests" | ||
| ``` | ||
|
|
||
| The eval-runner-analyst agent handles the complete workflow: running tests, collecting results, and providing detailed analysis of pass/fail patterns. | ||
|
|
||
| **Manual CLI usage** (if needed): | ||
|
|
||
| The eval runner automatically loads environment variables from `.env` in the project root. | ||
|
|
||
| ```bash | ||
| # Run agent evaluations (launches headless Chrome by default) | ||
| npx tsx scripts/eval-runner/cli.ts --tool action_agent --verbose | ||
| npx tsx scripts/eval-runner/cli.ts --test action-agent-click-001 --verbose | ||
|
|
||
| # Use Cerebras for fast inference (preferred models: zai-glm-4.6, gpt-oss-120b) | ||
| npx tsx scripts/eval-runner/cli.ts --provider cerebras --model zai-glm-4.6 --tool action_agent | ||
| npx tsx scripts/eval-runner/cli.ts --provider cerebras --model gpt-oss-120b --tool action_agent | ||
|
|
||
| # Run V0 agent variant | ||
| npx tsx scripts/eval-runner/cli.ts --tool action_agent --tool-override action_agent_v0 --provider cerebras --model gpt-oss-120b | ||
|
|
||
| # Connect to running Browser Operator (bypasses bot detection, uses authenticated sessions) | ||
| npx tsx scripts/eval-runner/cli.ts --tool action_agent --remote-debugging-port 9222 --verbose | ||
|
|
||
| # Run with visible browser | ||
| npx tsx scripts/eval-runner/cli.ts --tool action_agent --no-headless | ||
| ``` | ||
|
|
||
| **Note:** The LLM judge defaults to OpenAI (`gpt-4o`) regardless of agent provider. Override with `--judge-provider` and `--judge-model`. | ||
|
|
||
| ## Architecture | ||
|
|
||
| ### DevTools Module Hierarchy | ||
|
|
||
| ``` | ||
| front_end/ | ||
| ├── core/ # Shared utilities, CDP backend integration | ||
| ├── models/ # Business logic, data handling | ||
| ├── panels/ # High-level panels (one per DevTools tab) | ||
| ├── ui/components/ # Reusable UI components | ||
| └── entrypoints/ # Application entrypoints (devtools_app.ts) | ||
| ``` | ||
|
|
||
| Visibility rules: `core/` → `models/` → `panels/` → `entrypoints/` (enforced by GN build) | ||
|
|
||
| ### AI Chat Panel (`front_end/panels/ai_chat/`) | ||
|
|
||
| ``` | ||
| ai_chat/ | ||
| ├── agent_framework/ # Agent execution engine | ||
| │ ├── AgentRunner.ts # LLM loop, tool execution, handoffs | ||
| │ ├── ConfigurableAgentTool.ts # Agent definition via config objects | ||
| │ └── implementation/ # Concrete agent configs (ActionAgent, etc.) | ||
| ├── LLM/ # Provider integrations | ||
| │ ├── LLMClient.ts # Client facade | ||
| │ ├── LLMProviderRegistry.ts # Provider management | ||
| │ └── *Provider.ts # OpenAI, Cerebras, Anthropic, Groq, etc. | ||
| ├── cdp/ # Chrome DevTools Protocol adapters | ||
| │ ├── CDPSessionAdapter.ts # Abstract CDP interface | ||
| │ ├── DirectCDPAdapter.ts # Direct CDP connection (eval runner) | ||
| │ └── SDKTargetAdapter.ts # DevTools SDK integration | ||
| ├── tools/ # Agent tools (~30 tools for browser actions) | ||
| ├── dom/ # Element resolution (shadow DOM, iframes) | ||
| ├── common/ # Shared utilities (geometry, mouse, xpath) | ||
| ├── core/ # Orchestration, LLMConfigurationManager | ||
| ├── evaluation/ # Test case definitions | ||
| └── ui/ # Chat panel UI components | ||
| ``` | ||
|
|
||
| ### Key Concepts | ||
|
|
||
| **Agent Framework** | ||
| - `ConfigurableAgentTool`: Agents defined via config (name, prompt, tools, schema, handoffs) | ||
| - `AgentRunner`: Executes agent loop - LLM calls, tool execution, agent handoffs | ||
| - `ToolRegistry`: Central registry for tools/agents (`ToolRegistry.registerToolFactory()`) | ||
| - Handoffs: Agents transfer to specialists via LLM tool calls or max iterations | ||
|
|
||
| **CDP Adapters** - Abstraction layer for Chrome DevTools Protocol: | ||
| - `SDKTargetAdapter`: Used when running inside DevTools (has SDK access) | ||
| - `DirectCDPAdapter`: Used by eval runner (connects via chrome-remote-interface) | ||
| - Both implement `CDPSessionAdapter` interface with `getAgent(domain)` method | ||
|
|
||
| **LLM Configuration** (via `LLMConfigurationManager`): | ||
| - 3-tier models: Main (powerful), Mini (fast), Nano (simple tasks) | ||
| - Override system: Per-request overrides for eval without affecting localStorage | ||
| - Providers: openai, cerebras, anthropic, groq, openrouter, litellm | ||
|
|
||
| ### Adding a New Agent | ||
|
|
||
| ```typescript | ||
| // In implementation/ConfiguredAgents.ts | ||
| function createMyAgentConfig(): AgentToolConfig { | ||
| return { | ||
| name: 'my_agent', | ||
| description: 'What this agent does', | ||
| systemPrompt: 'Instructions for agent behavior', | ||
| tools: ['navigate_url', 'perform_action'], // Registered tool names | ||
| schema: { /* JSON schema for input */ }, | ||
| handoffs: [{ targetAgentName: 'specialist_agent', trigger: 'llm_tool_call' }], | ||
| maxIterations: 10, | ||
| }; | ||
| } | ||
|
|
||
| // Register in initializeConfiguredAgents() | ||
| const myAgent = new ConfigurableAgentTool(createMyAgentConfig()); | ||
| ToolRegistry.registerToolFactory('my_agent', () => myAgent); | ||
| ``` | ||
|
|
||
| ### Adding a New Tool | ||
|
|
||
| Tools implement the `Tool` interface with `name`, `description`, `schema`, and `execute()`. Register via `ToolRegistry.registerToolFactory()`. | ||
|
|
||
| ### Eval Runner Architecture | ||
|
|
||
| ``` | ||
| scripts/eval-runner/ | ||
| ├── cli.ts # CLI entry point | ||
| ├── TestRunner.ts # Test orchestration | ||
| ├── BrowserExecutor.ts # Puppeteer/CDP automation | ||
| ├── AgentBridge.ts # Connects runner to agent tools | ||
| ├── LLMJudge.ts # LLM-based evaluation scoring | ||
| └── reporters/ # Console, JSON, Markdown output | ||
| ``` | ||
|
|
||
| Test cases defined in `front_end/panels/ai_chat/evaluation/test-cases/`. | ||
|
|
||
| ## Environment Variables | ||
|
|
||
| ```bash | ||
| OPENAI_API_KEY=... # OpenAI | ||
| CEREBRAS_API_KEY=... # Cerebras (fast inference) | ||
| ANTHROPIC_API_KEY=... # Anthropic | ||
| BRAINTRUST_API_KEY=... # Experiment tracking (optional) | ||
| ``` | ||
|
|
||
| ## Key Patterns | ||
|
|
||
| - **Lazy loading**: Features dynamically imported via `*-meta.ts` files | ||
| - **GN build system**: Visibility rules enforce module boundaries; edit BUILD.gn when adding files | ||
| - **EventBus**: Uses `Common.ObjectWrapper.ObjectWrapper` for DevTools-compatible events | ||
| - **Shadow DOM/iframe support**: `EnhancedElementResolver` and `buildBackendIdMaps()` handle composed trees | ||
| - **Node ID mapping**: Accessibility tree `nodeId` differs from DOM `backendDOMNodeId`; use mapping utilities |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.