feat: add Steel cloud browser as alternative browser provider by nibzard · Pull Request #100 · TIGER-AI-Lab/ClawBench

nibzard · 2026-04-28T08:02:00Z

Summary

Adds an opt-in --browser=steel flag (CLI) and STEEL_API_KEY env switch (entrypoint) that routes the in-container CDP traffic to a Steel cloud session instead of local Chromium. The existing Docker + Chromium default is left completely untouched — Steel is an alternative provider, not a replacement.

Steel mode is "implicitly better" via richer per-run artifacts (rrweb events, post-run cookie/storage forensics, one-click replay viewer, comprehensive fingerprint metadata exposed by Steel's session API), not via feature-completeness parity with Docker mode.

Architecture (one integration point)

A small aiohttp shim, steel-cdp-shim.py, listens on 127.0.0.1:9222 inside the existing per-case container and acts as a byte-level WebSocket passthrough to wss://connect.steel.dev. Every existing CDP client — extension-server's eval interceptor and every harness's Playwright/MCP/browser-use connection — keeps using http://127.0.0.1:9222 unchanged. Zero changes to the eight setup-*.sh and run-*.sh scripts, harness CLIs, or batch.py concurrency.

When STEEL_API_KEY is set:

entrypoint.sh skips Xvfb / Chromium / socat / x11vnc / noVNC / ffmpeg
extension-server still runs (it's the eval interceptor); ffmpeg is gated on CLAWBENCH_STEEL_MODE since there's no Xvfb to capture
After harness exits, steel-collect-artifacts.py pulls Steel's session API endpoints and dumps results to /data/steel/

Load-bearing assumption — validated

Two simultaneous CDP clients on one Steel session (the eval-runner doing Target.setAutoAttach + Fetch.enable + a separate connection driving Page.navigate / in-page POST) have to coexist without one clobbering the other. This was the architectural risk; it's resolved.

tools/probe_steel_multiclient.py is included as a one-shot validator. Real-Steel run output:

[A-eval] enabling Target.setAutoAttach + Fetch.enable
[A-eval] auto-attached page session 1B9973E12F63
[B-harness] creating new page target
[A-eval] auto-attached page session DD480545D398
[B-harness] target 5B850F13C8DC created
[probe] A is attached to 2 page session(s)
[B-harness] Page.navigate → httpbin.org/get
[A-eval] Fetch.requestPaused GET https://httpbin.org/get
[B-harness] firing in-page POST → httpbin.org/post
[A-eval] Fetch.requestPaused POST https://httpbin.org/post
[probe] PASS — interception observed: {'url': 'https://httpbin.org/post', 'method': 'POST'}

Steel proxies Chrome's native multi-client CDP semantics, so no message-level mux in the shim is needed. Re-run the probe with STEEL_API_KEY=… uv run --extra dev python tools/probe_steel_multiclient.py to revalidate.

What you get with `--browser=steel` that you don't with local mode

data/steel/session.json — Steel session record: userAgent, dimensions, deviceConfig, region, stealthConfig, proxySource, creditsUsed, duration, eventCount, status
data/steel/events.jsonl — Steel's rrweb event stream (DOM mutations, input, network meta) — strictly richer than the recorder-extension actions.jsonl
data/steel/context.json — post-run cookies / localStorage / IndexedDB snapshot
data/steel/browser-version.json — Chrome / V8 version captured at session start
run-meta.json includes steel_session_viewer_url — one-click replay of any failed run

Files

New:

steel-cdp-shim.py (~250 LOC)
steel-collect-artifacts.py (~120 LOC)
tools/probe_steel_multiclient.py (multi-client validation script)
tests/test_steel_shim.py (4 unit tests, all passing)
tests/test_steel_provider.py (opt-in live integration test)

Modified (surgical):

entrypoint.sh — adds a STEEL_API_KEY branch at the top; existing flow untouched
extension-server/server.py — 5-line gate on ffmpeg in Steel mode
extension-server/pyproject.toml — adds aiohttp + steel-sdk to in-container deps
pyproject.toml — adds same to dev extras for host-side tests / probe
Dockerfile.base — COPY shim + collector into /app/
src/clawbench/run.py — --browser={local,steel} flag, threads STEEL_API_KEY into docker_run, surfaces steel_* keys in run-meta.json
src/clawbench/batch.py — --browser pass-through + best-effort Steel session-count print
src/clawbench/cli.py — click --browser option on run and batch so it shows in --help
README.md — short "Cloud browser provider — Steel" section

Total: 2047 insertions, 29 deletions across 15 files.

Out of scope (explicit)

Dockerless / host-direct mode — separate, larger change worth a follow-up PR.
claude-code-chrome-extension harness on Steel — fundamentally incompatible. The harness's bridge talks to the in-Chrome extension via Chrome native messaging, which can't span the cloud boundary. Rejected with a clear error message in run.py.
Action-recorder extension parity in Steel mode — Steel supports loading custom Chrome extensions (POST /v1/extensions + extensionIds), but the recorder POSTs to localhost:7878, which isn't reachable from cloud Chrome without a reverse tunnel. Steel's rrweb is a strict superset of what the recorder captures, so parity isn't urgent.
Replacing Docker-mode recording infrastructure — explicitly out; Docker mode is unchanged.

Tier / cost notes

Each parallel batch job creates one Steel session. Steel concurrency caps apply per API key (5 on Hobby, 100 on Pro).
Hobby tier caps sessions at 15 min; ClawBench's default time_limit is 30 min. The shim passes api_timeout = TIME_LIMIT_S * 1000 + 60_000 to sessions.create(); on tier-cap rejection it writes a steel_session_create_failed stop reason and exits cleanly (no silent truncation).
--browser=steel --human is rejected (no noVNC in Steel mode).

Test plan

Unit tests for the shim's serialization adapters and /json/version response (4 passing).
Multi-client CDP probe against a real Steel session (PASS, output above).
Existing test suite still passes (39 passed, 2 skipped — 1 platform XDG, 1 opt-in live integration).
In-container end-to-end run with --browser=steel against one test case per harness (skipped here to avoid building 7+ images on the contributor side; happy to run pre-merge if you'd like specific harnesses validated).

Live integration test runnable via:

STEEL_API_KEY=… CLAWBENCH_RUN_STEEL_INTEGRATION_TEST=1 \
  pytest tests/test_steel_provider.py

Acknowledgement

Authored by Steel.dev — happy to iterate on scope or split into smaller commits if that helps review.

🤖 Generated with Claude Code

Adds an opt-in `--browser=steel` flag (CLI) and `STEEL_API_KEY` env switch (entrypoint) that routes the in-container CDP traffic to a Steel cloud session instead of local Chromium. The existing Docker + Chromium default is left untouched. Architecture: - A small aiohttp shim (steel-cdp-shim.py) listens on 127.0.0.1:9222 and acts as a byte-level WS passthrough to wss://connect.steel.dev. Every existing CDP client (extension-server's eval interceptor and every harness) keeps using http://127.0.0.1:9222 unchanged. - entrypoint.sh detects STEEL_API_KEY and skips Xvfb / Chromium / socat / x11vnc / noVNC / ffmpeg in favour of the shim. extension- server still runs (it does the eval interception); ffmpeg is gated on CLAWBENCH_STEEL_MODE since there's no Xvfb to capture. - Multi-client CDP behaviour (eval-runner Fetch.enable + harness Playwright on one Steel session) validated empirically via tools/probe_steel_multiclient.py against a real Steel session — no CDP message muxing required. Steel mode produces richer artifacts than Docker mode: - data/steel/session.json — full session record (userAgent, dimensions, deviceConfig, region, stealthConfig, proxySource, creditsUsed, …) - data/steel/events.jsonl — Steel's rrweb event stream - data/steel/context.json — post-run cookies / localStorage / IndexedDB - data/steel/browser-version.json — Chrome/V8 captured at session start - run-meta.json includes steel_session_viewer_url for one-click replay Out of scope (explicit): - Dockerless / host-direct mode (separate, larger change) - claude-code-chrome-extension harness on Steel — its native-messaging bridge to the in-Chrome extension can't span the cloud boundary; rejected with a clear error from run.py. Tests: 39 passed, 1 skipped (live Steel integration is opt-in via CLAWBENCH_RUN_STEEL_INTEGRATION_TEST=1). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Perry2004 · 2026-04-29T01:24:12Z

Hi Nikola, thanks for the contribution!

It is attractive to have Steel integrated into ClawBench. I'll spend some time testing it in a bit. Looks like it will resolve a lot of the bot-detection hassles we encountered and provide better observability & scalability.

Though, I'll leave the PR open for now, as we're currently doing a refactor & cleanup on the current code base, which will likely need some changes on this integration to accompany. I'll include this in the backlog and come back to this after the refactoring.

nibzard · 2026-04-29T07:47:31Z

Thanks, makes sense! If you need anything from my side just reach out.

Perry2004 added enhancement New feature or request good first issue Good for newcomers labels Apr 29, 2026

Perry2004 linked an issue Apr 29, 2026 that may be closed by this pull request

FEAT: support steel browser #105

Open

reacher-z added this to ClawBench May 3, 2026

github-project-automation Bot moved this to Todo in ClawBench May 3, 2026

Perry2004 self-requested a review May 3, 2026 03:54

Perry2004 moved this from Todo to In Progress in ClawBench May 3, 2026

Perry2004 moved this from In Progress to Under Review in ClawBench May 3, 2026

Perry2004 mentioned this pull request May 13, 2026

fix: resolve #105 — FEAT: support steel browser #152

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add Steel cloud browser as alternative browser provider#100

feat: add Steel cloud browser as alternative browser provider#100
nibzard wants to merge 1 commit into
TIGER-AI-Lab:mainfrom
nibzard:feat/steel-browser-provider

nibzard commented Apr 28, 2026

Uh oh!

Perry2004 commented Apr 29, 2026

Uh oh!

nibzard commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

nibzard commented Apr 28, 2026

Summary

Architecture (one integration point)

Load-bearing assumption — validated

What you get with --browser=steel that you don't with local mode

Files

Out of scope (explicit)

Tier / cost notes

Test plan

Acknowledgement

Uh oh!

Perry2004 commented Apr 29, 2026

Uh oh!

nibzard commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

What you get with `--browser=steel` that you don't with local mode