feat: add Steel cloud browser as alternative browser provider#100
Open
nibzard wants to merge 1 commit into
Open
feat: add Steel cloud browser as alternative browser provider#100nibzard wants to merge 1 commit into
nibzard wants to merge 1 commit into
Conversation
Adds an opt-in `--browser=steel` flag (CLI) and `STEEL_API_KEY` env switch (entrypoint) that routes the in-container CDP traffic to a Steel cloud session instead of local Chromium. The existing Docker + Chromium default is left untouched. Architecture: - A small aiohttp shim (steel-cdp-shim.py) listens on 127.0.0.1:9222 and acts as a byte-level WS passthrough to wss://connect.steel.dev. Every existing CDP client (extension-server's eval interceptor and every harness) keeps using http://127.0.0.1:9222 unchanged. - entrypoint.sh detects STEEL_API_KEY and skips Xvfb / Chromium / socat / x11vnc / noVNC / ffmpeg in favour of the shim. extension- server still runs (it does the eval interception); ffmpeg is gated on CLAWBENCH_STEEL_MODE since there's no Xvfb to capture. - Multi-client CDP behaviour (eval-runner Fetch.enable + harness Playwright on one Steel session) validated empirically via tools/probe_steel_multiclient.py against a real Steel session — no CDP message muxing required. Steel mode produces richer artifacts than Docker mode: - data/steel/session.json — full session record (userAgent, dimensions, deviceConfig, region, stealthConfig, proxySource, creditsUsed, …) - data/steel/events.jsonl — Steel's rrweb event stream - data/steel/context.json — post-run cookies / localStorage / IndexedDB - data/steel/browser-version.json — Chrome/V8 captured at session start - run-meta.json includes steel_session_viewer_url for one-click replay Out of scope (explicit): - Dockerless / host-direct mode (separate, larger change) - claude-code-chrome-extension harness on Steel — its native-messaging bridge to the in-Chrome extension can't span the cloud boundary; rejected with a clear error from run.py. Tests: 39 passed, 1 skipped (live Steel integration is opt-in via CLAWBENCH_RUN_STEEL_INTEGRATION_TEST=1). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Collaborator
|
Hi Nikola, thanks for the contribution! It is attractive to have Steel integrated into ClawBench. I'll spend some time testing it in a bit. Looks like it will resolve a lot of the bot-detection hassles we encountered and provide better observability & scalability. Though, I'll leave the PR open for now, as we're currently doing a refactor & cleanup on the current code base, which will likely need some changes on this integration to accompany. I'll include this in the backlog and come back to this after the refactoring. |
Author
|
Thanks, makes sense! If you need anything from my side just reach out. |
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds an opt-in
--browser=steelflag (CLI) andSTEEL_API_KEYenv switch (entrypoint) that routes the in-container CDP traffic to a Steel cloud session instead of local Chromium. The existing Docker + Chromium default is left completely untouched — Steel is an alternative provider, not a replacement.Steel mode is "implicitly better" via richer per-run artifacts (rrweb events, post-run cookie/storage forensics, one-click replay viewer, comprehensive fingerprint metadata exposed by Steel's session API), not via feature-completeness parity with Docker mode.
Architecture (one integration point)
A small aiohttp shim,
steel-cdp-shim.py, listens on127.0.0.1:9222inside the existing per-case container and acts as a byte-level WebSocket passthrough towss://connect.steel.dev. Every existing CDP client —extension-server's eval interceptor and every harness's Playwright/MCP/browser-use connection — keeps usinghttp://127.0.0.1:9222unchanged. Zero changes to the eightsetup-*.shandrun-*.shscripts, harness CLIs, orbatch.pyconcurrency.When
STEEL_API_KEYis set:entrypoint.shskips Xvfb / Chromium / socat / x11vnc / noVNC / ffmpegextension-serverstill runs (it's the eval interceptor); ffmpeg is gated onCLAWBENCH_STEEL_MODEsince there's no Xvfb to capturesteel-collect-artifacts.pypulls Steel's session API endpoints and dumps results to/data/steel/Load-bearing assumption — validated
Two simultaneous CDP clients on one Steel session (the eval-runner doing
Target.setAutoAttach + Fetch.enable+ a separate connection drivingPage.navigate/ in-page POST) have to coexist without one clobbering the other. This was the architectural risk; it's resolved.tools/probe_steel_multiclient.pyis included as a one-shot validator. Real-Steel run output:Steel proxies Chrome's native multi-client CDP semantics, so no message-level mux in the shim is needed. Re-run the probe with
STEEL_API_KEY=… uv run --extra dev python tools/probe_steel_multiclient.pyto revalidate.What you get with
--browser=steelthat you don't with local modedata/steel/session.json— Steel session record:userAgent,dimensions,deviceConfig,region,stealthConfig,proxySource,creditsUsed,duration,eventCount,statusdata/steel/events.jsonl— Steel's rrweb event stream (DOM mutations, input, network meta) — strictly richer than the recorder-extensionactions.jsonldata/steel/context.json— post-run cookies / localStorage / IndexedDB snapshotdata/steel/browser-version.json— Chrome / V8 version captured at session startrun-meta.jsonincludessteel_session_viewer_url— one-click replay of any failed runFiles
New:
steel-cdp-shim.py(~250 LOC)steel-collect-artifacts.py(~120 LOC)tools/probe_steel_multiclient.py(multi-client validation script)tests/test_steel_shim.py(4 unit tests, all passing)tests/test_steel_provider.py(opt-in live integration test)Modified (surgical):
entrypoint.sh— adds aSTEEL_API_KEYbranch at the top; existing flow untouchedextension-server/server.py— 5-line gate on ffmpeg in Steel modeextension-server/pyproject.toml— addsaiohttp+steel-sdkto in-container depspyproject.toml— adds same to dev extras for host-side tests / probeDockerfile.base—COPYshim + collector into/app/src/clawbench/run.py—--browser={local,steel}flag, threadsSTEEL_API_KEYintodocker_run, surfacessteel_*keys inrun-meta.jsonsrc/clawbench/batch.py—--browserpass-through + best-effort Steel session-count printsrc/clawbench/cli.py— click--browseroption onrunandbatchso it shows in--helpREADME.md— short "Cloud browser provider — Steel" sectionTotal: 2047 insertions, 29 deletions across 15 files.
Out of scope (explicit)
claude-code-chrome-extensionharness on Steel — fundamentally incompatible. The harness's bridge talks to the in-Chrome extension via Chrome native messaging, which can't span the cloud boundary. Rejected with a clear error message inrun.py.POST /v1/extensions+extensionIds), but the recorder POSTs tolocalhost:7878, which isn't reachable from cloud Chrome without a reverse tunnel. Steel's rrweb is a strict superset of what the recorder captures, so parity isn't urgent.Tier / cost notes
batchjob creates one Steel session. Steel concurrency caps apply per API key (5 on Hobby, 100 on Pro).time_limitis 30 min. The shim passesapi_timeout = TIME_LIMIT_S * 1000 + 60_000tosessions.create(); on tier-cap rejection it writes asteel_session_create_failedstop reason and exits cleanly (no silent truncation).--browser=steel --humanis rejected (no noVNC in Steel mode).Test plan
/json/versionresponse (4 passing).--browser=steelagainst one test case per harness (skipped here to avoid building 7+ images on the contributor side; happy to run pre-merge if you'd like specific harnesses validated).Acknowledgement
Authored by Steel.dev — happy to iterate on scope or split into smaller commits if that helps review.
🤖 Generated with Claude Code