Skip to content

feat: add Steel cloud browser as alternative browser provider#100

Open
nibzard wants to merge 1 commit into
TIGER-AI-Lab:mainfrom
nibzard:feat/steel-browser-provider
Open

feat: add Steel cloud browser as alternative browser provider#100
nibzard wants to merge 1 commit into
TIGER-AI-Lab:mainfrom
nibzard:feat/steel-browser-provider

Conversation

@nibzard
Copy link
Copy Markdown

@nibzard nibzard commented Apr 28, 2026

Summary

Adds an opt-in --browser=steel flag (CLI) and STEEL_API_KEY env switch (entrypoint) that routes the in-container CDP traffic to a Steel cloud session instead of local Chromium. The existing Docker + Chromium default is left completely untouched — Steel is an alternative provider, not a replacement.

Steel mode is "implicitly better" via richer per-run artifacts (rrweb events, post-run cookie/storage forensics, one-click replay viewer, comprehensive fingerprint metadata exposed by Steel's session API), not via feature-completeness parity with Docker mode.

Architecture (one integration point)

A small aiohttp shim, steel-cdp-shim.py, listens on 127.0.0.1:9222 inside the existing per-case container and acts as a byte-level WebSocket passthrough to wss://connect.steel.dev. Every existing CDP client — extension-server's eval interceptor and every harness's Playwright/MCP/browser-use connection — keeps using http://127.0.0.1:9222 unchanged. Zero changes to the eight setup-*.sh and run-*.sh scripts, harness CLIs, or batch.py concurrency.

When STEEL_API_KEY is set:

  • entrypoint.sh skips Xvfb / Chromium / socat / x11vnc / noVNC / ffmpeg
  • extension-server still runs (it's the eval interceptor); ffmpeg is gated on CLAWBENCH_STEEL_MODE since there's no Xvfb to capture
  • After harness exits, steel-collect-artifacts.py pulls Steel's session API endpoints and dumps results to /data/steel/

Load-bearing assumption — validated

Two simultaneous CDP clients on one Steel session (the eval-runner doing Target.setAutoAttach + Fetch.enable + a separate connection driving Page.navigate / in-page POST) have to coexist without one clobbering the other. This was the architectural risk; it's resolved.

tools/probe_steel_multiclient.py is included as a one-shot validator. Real-Steel run output:

[A-eval] enabling Target.setAutoAttach + Fetch.enable
[A-eval] auto-attached page session 1B9973E12F63
[B-harness] creating new page target
[A-eval] auto-attached page session DD480545D398
[B-harness] target 5B850F13C8DC created
[probe] A is attached to 2 page session(s)
[B-harness] Page.navigate → httpbin.org/get
[A-eval] Fetch.requestPaused GET https://httpbin.org/get
[B-harness] firing in-page POST → httpbin.org/post
[A-eval] Fetch.requestPaused POST https://httpbin.org/post
[probe] PASS — interception observed: {'url': 'https://httpbin.org/post', 'method': 'POST'}

Steel proxies Chrome's native multi-client CDP semantics, so no message-level mux in the shim is needed. Re-run the probe with STEEL_API_KEY=… uv run --extra dev python tools/probe_steel_multiclient.py to revalidate.

What you get with --browser=steel that you don't with local mode

  • data/steel/session.json — Steel session record: userAgent, dimensions, deviceConfig, region, stealthConfig, proxySource, creditsUsed, duration, eventCount, status
  • data/steel/events.jsonl — Steel's rrweb event stream (DOM mutations, input, network meta) — strictly richer than the recorder-extension actions.jsonl
  • data/steel/context.json — post-run cookies / localStorage / IndexedDB snapshot
  • data/steel/browser-version.json — Chrome / V8 version captured at session start
  • run-meta.json includes steel_session_viewer_url — one-click replay of any failed run

Files

New:

  • steel-cdp-shim.py (~250 LOC)
  • steel-collect-artifacts.py (~120 LOC)
  • tools/probe_steel_multiclient.py (multi-client validation script)
  • tests/test_steel_shim.py (4 unit tests, all passing)
  • tests/test_steel_provider.py (opt-in live integration test)

Modified (surgical):

  • entrypoint.sh — adds a STEEL_API_KEY branch at the top; existing flow untouched
  • extension-server/server.py — 5-line gate on ffmpeg in Steel mode
  • extension-server/pyproject.toml — adds aiohttp + steel-sdk to in-container deps
  • pyproject.toml — adds same to dev extras for host-side tests / probe
  • Dockerfile.baseCOPY shim + collector into /app/
  • src/clawbench/run.py--browser={local,steel} flag, threads STEEL_API_KEY into docker_run, surfaces steel_* keys in run-meta.json
  • src/clawbench/batch.py--browser pass-through + best-effort Steel session-count print
  • src/clawbench/cli.py — click --browser option on run and batch so it shows in --help
  • README.md — short "Cloud browser provider — Steel" section

Total: 2047 insertions, 29 deletions across 15 files.

Out of scope (explicit)

  • Dockerless / host-direct mode — separate, larger change worth a follow-up PR.
  • claude-code-chrome-extension harness on Steel — fundamentally incompatible. The harness's bridge talks to the in-Chrome extension via Chrome native messaging, which can't span the cloud boundary. Rejected with a clear error message in run.py.
  • Action-recorder extension parity in Steel mode — Steel supports loading custom Chrome extensions (POST /v1/extensions + extensionIds), but the recorder POSTs to localhost:7878, which isn't reachable from cloud Chrome without a reverse tunnel. Steel's rrweb is a strict superset of what the recorder captures, so parity isn't urgent.
  • Replacing Docker-mode recording infrastructure — explicitly out; Docker mode is unchanged.

Tier / cost notes

  • Each parallel batch job creates one Steel session. Steel concurrency caps apply per API key (5 on Hobby, 100 on Pro).
  • Hobby tier caps sessions at 15 min; ClawBench's default time_limit is 30 min. The shim passes api_timeout = TIME_LIMIT_S * 1000 + 60_000 to sessions.create(); on tier-cap rejection it writes a steel_session_create_failed stop reason and exits cleanly (no silent truncation).
  • --browser=steel --human is rejected (no noVNC in Steel mode).

Test plan

  • Unit tests for the shim's serialization adapters and /json/version response (4 passing).
  • Multi-client CDP probe against a real Steel session (PASS, output above).
  • Existing test suite still passes (39 passed, 2 skipped — 1 platform XDG, 1 opt-in live integration).
  • In-container end-to-end run with --browser=steel against one test case per harness (skipped here to avoid building 7+ images on the contributor side; happy to run pre-merge if you'd like specific harnesses validated).
  • Live integration test runnable via:
    STEEL_API_KEY=… CLAWBENCH_RUN_STEEL_INTEGRATION_TEST=1 \
      pytest tests/test_steel_provider.py
    

Acknowledgement

Authored by Steel.dev — happy to iterate on scope or split into smaller commits if that helps review.

🤖 Generated with Claude Code

Adds an opt-in `--browser=steel` flag (CLI) and `STEEL_API_KEY` env
switch (entrypoint) that routes the in-container CDP traffic to a
Steel cloud session instead of local Chromium. The existing Docker
+ Chromium default is left untouched.

Architecture:
- A small aiohttp shim (steel-cdp-shim.py) listens on 127.0.0.1:9222
  and acts as a byte-level WS passthrough to wss://connect.steel.dev.
  Every existing CDP client (extension-server's eval interceptor and
  every harness) keeps using http://127.0.0.1:9222 unchanged.
- entrypoint.sh detects STEEL_API_KEY and skips Xvfb / Chromium /
  socat / x11vnc / noVNC / ffmpeg in favour of the shim. extension-
  server still runs (it does the eval interception); ffmpeg is gated
  on CLAWBENCH_STEEL_MODE since there's no Xvfb to capture.
- Multi-client CDP behaviour (eval-runner Fetch.enable + harness
  Playwright on one Steel session) validated empirically via
  tools/probe_steel_multiclient.py against a real Steel session —
  no CDP message muxing required.

Steel mode produces richer artifacts than Docker mode:
- data/steel/session.json — full session record (userAgent, dimensions,
  deviceConfig, region, stealthConfig, proxySource, creditsUsed, …)
- data/steel/events.jsonl — Steel's rrweb event stream
- data/steel/context.json — post-run cookies / localStorage / IndexedDB
- data/steel/browser-version.json — Chrome/V8 captured at session start
- run-meta.json includes steel_session_viewer_url for one-click replay

Out of scope (explicit):
- Dockerless / host-direct mode (separate, larger change)
- claude-code-chrome-extension harness on Steel — its native-messaging
  bridge to the in-Chrome extension can't span the cloud boundary;
  rejected with a clear error from run.py.

Tests: 39 passed, 1 skipped (live Steel integration is opt-in via
CLAWBENCH_RUN_STEEL_INTEGRATION_TEST=1).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Perry2004
Copy link
Copy Markdown
Collaborator

Hi Nikola, thanks for the contribution!

It is attractive to have Steel integrated into ClawBench. I'll spend some time testing it in a bit. Looks like it will resolve a lot of the bot-detection hassles we encountered and provide better observability & scalability.

Though, I'll leave the PR open for now, as we're currently doing a refactor & cleanup on the current code base, which will likely need some changes on this integration to accompany. I'll include this in the backlog and come back to this after the refactoring.

@Perry2004 Perry2004 added enhancement New feature or request good first issue Good for newcomers labels Apr 29, 2026
@Perry2004 Perry2004 linked an issue Apr 29, 2026 that may be closed by this pull request
@nibzard
Copy link
Copy Markdown
Author

nibzard commented Apr 29, 2026

Thanks, makes sense! If you need anything from my side just reach out.

@github-project-automation github-project-automation Bot moved this to Todo in ClawBench May 3, 2026
@Perry2004 Perry2004 self-requested a review May 3, 2026 03:54
@Perry2004 Perry2004 moved this from Todo to In Progress in ClawBench May 3, 2026
@Perry2004 Perry2004 moved this from In Progress to Under Review in ClawBench May 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request good first issue Good for newcomers

Projects

None yet

Development

Successfully merging this pull request may close these issues.

FEAT: support steel browser

3 participants