Canonical operating roadmap for browser-agent-driver.
This is the single planning document for:
- mission
- success criteria
- benchmark policy
- promotion gates
- failure taxonomy
- execution order
- immediate priorities
Use RELIABILITY.md for the day-to-day run loop. Use competitor-analysis-2026-03.md for external reference points. Use README.md for package/API/CLI surface only.
Build a general-purpose browser agent that completes real tasks reliably, produces complete artifacts, and improves through controlled measurement instead of anecdotal prompt tuning.
Primary outcomes:
- higher pass rate
- lower median duration
- lower median turns
- lower token cost
- complete artifacts on every run
Non-goals:
- optimizing for one demo app at the expense of generality
- shipping features without measurable reliability impact
- widening benchmark scope before the current slice is stable
- mixing product work and research work in the same experiment
The system is healthy only when all are true:
- Tier 1 remains at 100%
- Tier 2 trends to 100% through bug closure, not benchmark filtering
- Tier 3 is used to measure generalization, not excuse regressions
- every serious run emits report, manifest, and recording
- promotion decisions are backed by repeated seeded runs
The program is succeeding when we can repeatedly do this loop:
- measure a clean baseline
- classify failures correctly
- fix one high-leverage failure class
- rerun the same slice
- promote only when the delta holds
This is an eval-driven control system. Treat it like one.
- Fix execution bugs before policy tuning.
- Fix verifier bugs before prompt tuning.
- Treat zero-turn and startup failures as infrastructure until proven otherwise.
- Change one variable at a time.
- Prefer deterministic fixes over prompt inflation.
- Keep product-specific hints optional.
- Keep wallet and crypto behavior isolated behind explicit flags.
- Do not call a result real unless the artifacts and config are preserved.
Build toward a small layered control system, not a generic framework:
actor: main browser policy loopscout: cheap recommendation pass on ambiguous link/result pages onlyverifier: deterministic completion policy plus LLM verification where neededsupervisor: hard-stall recovery only
Guardrails:
- no new browser action types for scouting
- no broad plugin system
- no default branching everywhere
- vision, branching, and extra compute stay challenger-only until they win repeatedly
- local, controlled, repeatable
- must stay at 100%
- blocks merges
- staging or owned environments with credentials
- target is 100%
- used to validate real product flows
- open-web capability and generalization
- expected to be noisy
- cannot justify regressions in Tier 1 or Tier 2
Promotion-grade experiments must keep these fixed:
- scenario slice
- seed
- model
- timeout budget
- browser mode
- memory policy
- artifact policy
Required metrics:
- pass rate
- median duration
- median turns
- token usage
- artifact completeness
- failure-class distribution
Parallelism policy:
- parallelize inside one experiment first
- do not run multiple unrelated promotion-grade experiments in parallel
- use outer-loop parallelism only for coarse screening, never for final decisions
Memory policy:
- isolate memory per run during benchmark experiments unless memory is the intervention being tested
- never compare contaminated and uncontaminated arms
Model policy:
- product/runtime defaults may advance as newer official models become available
- benchmark controls stay pinned until a slice is intentionally re-baselined
- newer models belong in challenger arms until they beat the fixed control cleanly
Promote only when all are true:
- no Tier 1 regression
- no Tier 2 regression
- target-slice pass rate is positive or neutral
- artifacts remain complete
- failure mix does not shift toward a worse structural class
- if pass rate is flat, duration or token cost improvement is meaningful
Reject or roll back when any are true:
- Tier 1 drops
- Tier 2 drops without an explicit temporary exception
- artifact completeness degrades
- the apparent gain depends on one-off wins or unseeded runs
- the change mixes multiple interventions and cannot be attributed cleanly
Every failure must land in one bucket before work is prioritized.
The browser or driver does the wrong thing.
Examples:
- popup or new-tab not adopted
- stale selector handling broken
- dead click or type path
- auth state not applied
- incorrect browser storage/session setup
Action:
- fix the runtime or driver
The agent did the work, but completion was rejected incorrectly.
Examples:
- first-party sibling-subdomain mismatch
- script-extracted evidence ignored
- a11y-only visibility bias
- correct answer rejected because the checker was too narrow
Action:
- make verification policy deterministic
The agent is capable, but wastes turns.
Examples:
- repeated search reformulations
- excessive backtracking
- late completion after enough evidence exists
- repeated verifier bounce-back loops
- unnecessary navigation after landing on the right page
Action:
- add heuristics, recovery rules, or prompt changes only after structural causes are ruled out
The system is correct but unstable under budget.
Examples:
- intermittent first-turn timeout
- provider latency spikes
- anti-bot variance
- noisy public-site dependencies
- rate-limit or quota instability
Action:
- instrument, isolate, and adjust budget or retry policy
The task should stop early rather than brute-force through a dead end.
Examples:
- captcha
- hard auth wall without credentials
- network unreachable
- domain constraints incompatible with the live site
Action:
- classify and stop early
Every promotion-grade run must expose:
- first
navigatetiming - first
observetiming - first
decidetiming - first
executetiming - total turns
- repeated-query count
- verifier rejection count
- turns after first sufficient evidence
- final failure class
- report, manifest, and recording paths
Without this, optimization work is guesswork.
A change is not done when code compiles. It is done when all are true:
- the failure class is explicitly identified
- the fix is scoped to that class
- tests cover the regression where practical
- the same seeded slice is rerun
- results are attributable to the one change
- artifacts are preserved
- the promotion gate is passed or the change remains flagged
Goal:
- make results trustworthy
Checklist:
- stable seeded slice
- explicit per-run config capture
- artifact completeness checks
- reproducible memory isolation
- clear failure taxonomy output
- first-turn phase timing in reports
Exit criteria:
- repeated runs are comparable enough to support promotion decisions
Goal:
- remove runtime and verifier defects that create false failures
Checklist:
- popup and new-tab handling
- auth and storage-state correctness
- first-party host policy
- script-backed extraction policy
- terminal blocker fast-fail rules
- startup and zero-turn failure classification
Exit criteria:
- obvious false negatives and execution traps are gone
Goal:
- reduce wasted turns and budget burn
Checklist:
- search-result page heuristics
- cheap
scoutrecommendations on ambiguous visible-link/result pages - sufficient-evidence early completion
- bounded recovery for reformulation loops
- earlier extraction on search, catalog, and filter pages
- fewer redundant navigations after landing on good pages
- waste accounting in every report
Exit criteria:
- median turns and duration drop on the same slice without pass-rate loss
Goal:
- improve path choice without turning the codebase into a generic orchestration framework
Checklist:
- keep
actoras the main loop - add
scoutonly as a narrow ambiguous-page recommendation pass - keep
verifierdeterministic-first - keep
supervisorrecovery-only - keep branching and extra compute behind explicit challengers
Exit criteria:
- path quality improves on repeated slices
- architecture remains small, testable, and debuggable
Goal:
- compare strategies scientifically
Checklist:
- one baseline
- one challenger
- fixed seed, cases, and budget
- CI-aware comparison
- rollback path defined before promotion
Possible challengers:
- supervisor variants
- prompt variants
- routing variants
- memory variants
- bounded branch exploration variants
Exit criteria:
- the winner beats baseline with enough evidence to promote
Goal:
- make the winning path usable end to end
Checklist:
- app to worker to orchestrator to sandbox execution path
- auth files end to end
- artifact upload path
- live run visibility
- clean run reports and video playback
- CI setup path for users
Exit criteria:
- one real authenticated dogfood flow works end to end with artifacts
Goal:
- broaden coverage without losing rigor
Checklist:
- add WebVoyager
- expand WebBench slices
- add owned staging flows
- add optional wallet and crypto suites behind flags
Exit criteria:
- breadth increases without losing comparability discipline
- classify the failure
- define the narrowest fix
- add or update tests
- rerun the same seeded slice
- compare against control
- promote, flag, or revert
- run Tier 1 and Tier 2 control baselines
- aggregate failures by class
- fix the highest-frequency structural issue first
- rerun baseline
- review pass rate, duration, turns, and cost trends
- review top failure classes
- decide the next single intervention
- retire dead-end experiments
The direction is correct:
- execution bugs were being misread as agent weakness
- verifier bugs were being misread as policy weakness
- benchmark integrity needed hardening before meaningful supervisor or prompt work
Recent wins that fit this model:
- popup and new-tab adoption
- first-party sibling-subdomain verification policy
- script-backed extraction acceptance
.envand benchmark config integrity fixes
This is the canonical finish-line tracker. Work is done only when every item here is complete and verified.
| Track | Status | Done when | Verification |
|---|---|---|---|
| Tier 1 deterministic fixtures | Verified baseline | Stable at 100% on repeated local runs | npm run bench:tier1:gate |
| Tier 2 authenticated core flows | Verified baseline | Stable at 100% with real auth state and complete artifacts across repeated runs | npm run bench:tier2:repeat -- --storage-state ./.auth/ai-tangle-tools.json |
Tier 3 public-web reach3 baseline |
Verified baseline | At least 5 repeated seeded runs with no case below 80% pass and no structural false-positive class open | npm run bench:tier3:gate -- --existing-root ./agent-results/tier3-gate-visible-release-1772847117 |
| Search/domain policy correctness | Verified baseline | Disallowed-host clicks and false-positive completions are blocked deterministically | repeated NIH runs + targeted tests |
| Artifact completeness | Verified baseline | Every serious run emits report, manifest, and recording | artifact completeness checks in baseline/gate summaries |
| Cost and turn efficiency | Verified baseline | Per-turn/per-case/per-suite cost tracking via LiteLLM pricing DB; adaptive routing tested and tuned (verification-only on gpt-4.1-mini) | repeated baseline summaries + cost reports |
| Vision challenger | In progress | Vision-based policy must beat or match the baseline on repeated seeded runs before promotion | challenger-only repeated runs; not baseline |
| Product path readiness | Verified baseline | Winning execution path is wired cleanly into app -> worker -> orchestrator -> artifacts | verified local dogfood in abd-app: npm --prefix worker run e2e:real-ui |
This section is an operational snapshot, not roadmap authority. Keep policy and target architecture above stable; refresh or prune this section as the measured state changes.
Current honest status:
- Tier 1 deterministic control is green on the promoted local fixture set
- FULL WEBBENCH-50: 48/50 (96%) projected single-run, 98% reachable
- Previous: 45/50 (90%) chrome-channel → 42/50 (84%) patchright → 41/50 (82%) v3 → 36/50 (72%) v2 → 34/50 (68%) v1 → 19/50 (38%) non-stealth
- 0 task failures — every reachable site passes
- Failure breakdown: 1 anti-bot (Cambridge), 1 timeout (AliExpress)
- Fixes: ref resolution nth() for duplicate role+name, nav timeout 15s cap, observe load cap 10s, execute wall-clock 45s cap, filter strategy nudge, strengthened rules 22-24
- Results:
agent-results/track-1773009976576/(v3 targeted),agent-results/track-1773009076385/(v2 targeted)
- Tier 2 repeated authenticated control is green across three valid repetitions
openai/gpt-5.4remains the promoted default runtimewebbench-stealthis the recommended profile for Tier 3 benchmarks- key systemic fixes:
- stealth profile: headed mode + anti-detection flags + minimal resource blocking
- navigator property patching: plugins, languages, hardwareConcurrency, deviceMemory, chrome.runtime
- domain constraint relaxation: registrable domain matching (fixes subdomain redirects)
- progressive acceptance: Tier A (0.55+evidence), Tier B (0.50 after 2), Tier C (0.40 after 3)
- URL mapping fix in track script:
scenario.urlfallback saves 2 turns/run - system prompt rules 18-24: data extraction, form field targeting, section navigation, efficient completion, extract-before-navigate, filter-vs-search, heavy page recovery
- verifier: stronger SUPPLEMENTAL TOOL EVIDENCE trust, multi-page evidence acceptance
- escalation: 3-tier rejection feedback directing agent to use runScript for extraction
- mid-run extraction reminder when 50%+ turns used without evidence (strengthened with stop-navigating directive)
- filter strategy nudge at turn 8 for goals mentioning price/filter/sort
- ref resolution fix: nth() indexing for duplicate role+name elements (fixes Groupon price filter → search box misdirection)
- navigation timeout cap: 15s for page.goto, continues with partial DOM on timeout
- observe load state cap: 10s for waitForLoadState (prevents 30s+ stalls on heavy JS sites)
- execute wall-clock cap: 45s total including overlay recovery and retries
- iframe consent dialog dismissal via runScript (SourcePoint/OneTrust iframes)
- relaxed script-backed completion: accepts any URL in claim, broader verifier pattern matching
- evidence limit 3→5, content discovery rule, prioritized snapshot budgeting
- remaining failures (2, all infrastructure):
- anti-bot (1): Cambridge (Cloudflare TLS fingerprinting — server-side, unfixable without TLS library patches)
- timeout (1): AliExpress (page JS so heavy that 2 turns consume 180s — observe+execute compound)
- newly fixed (6 cases, 42→48):
- AllTrails, Crunchbase, Forbes: system Chrome
channel: 'chrome'fixed TLS fingerprint detection - JustDial: strengthened rule 22 (extract-before-navigate) + listing extraction nudge
- Groupon: ref resolution nth() fix (price filter was resolving to search box) + filter strategy nudge
- Goal.com: improved rule 20 (section navigation) + content discovery
- AllTrails, Crunchbase, Forbes: system Chrome
scoutremains challenger-only; not promoted- cost tracking: LiteLLM-backed pricing (2,200+ models), 24h disk cache, tiered context pricing
- per-turn cost in debug CLI output, per-case and per-suite totals in reports and track summaries
- gpt-5.4 tiered: $2.50/$15.00 base, $5.00/$22.50 above 272K context (agent runs ~30K, always base)
- fallback table covers gpt-5.4/5.2/5.1/5.3-codex/4.1/4.1-mini/4.1-nano/4o/4o-mini + Claude Opus/Sonnet/Haiku 4.5/4.6
- adaptive model routing experiment results (2026-03-08):
- v1 (nav model for early 30% of turns): 2/3 pass, $0.72 — worse decisions cascade into more turns
- v2 (nav model for turns 1-2 only): 3/3 pass, $1.14 — bad first turns compound into longer runs
- v3 (verification-only on gpt-4.1-mini): 3/3 pass, $0.48 — matches no-routing baseline ($0.49)
- finding: gpt-5.4 is more cost-effective than routing because it completes in fewer turns
- shipped: verification calls route to gpt-4.1-mini; decide() stays on primary model
- history compression experiment (2026-03-08):
- aggressive one-line summaries: COUNTERPRODUCTIVE — agent loses visited-page context, 2x turns
- current compactHistory() (strip ELEMENTS, keep full text) is empirically optimal
- warm memory experiment (2026-03-08, clean A/B with fixed compression):
- baseline (no memory): 3/3 pass, 20 turns total, $0.51
- cold memory: 3/3 pass, 22 turns, $0.56 (+10% overhead from trajectory storage)
- warm memory: 3/3 pass, 16 turns, $0.49 (Yale 7→5, NIH 11→9 turns)
- trajectory matching is the main value: stores successful run paths for reuse
- recommend: enable
--memoryfor repeated benchmarks, keep--memory-isolation per-runfor A/B
Current best evidence:
- Tier 1 deterministic summary:
./agent-results/tier1-green-1772794410/tier1-gate-summary.json - Tier 1 deterministic markdown:
./agent-results/tier1-green-1772794410/tier1-gate-summary.md - clean corrected
reach3:./agent-results/reach3-contenthub-v4-1772786683/track-summary.json - current promoted repeated
reach3:./agent-results/tier3-gate-visible-release-1772847117/ - current promoted Tier 3 summary:
./agent-results/tier3-gate-visible-release-1772847117/tier3-gate-summary.json - current promoted Tier 3 markdown:
./agent-results/tier3-gate-visible-release-1772847117/tier3-gate-summary.md - Tier 2 repeated authenticated summary:
./agent-results/tier2-repeat-green-1772792440/tier2-repeat-summary.json - Tier 2 repeated authenticated markdown:
./agent-results/tier2-repeat-green-1772792440/tier2-repeat-summary.md - Tier 2 post-fix template verification summary:
./agent-results/tier2-repeat-post-template-fix-1772794740/tier2-repeat-summary.json - NIH post-fix focused summary:
./agent-results/nih-token-fix-repeat-1772795250/tier3-gate-summary.json - local product-path evidence lives in
abd-app:/tmp/abd-real-ui-e2e-gate-1772826110/ - provider screen, OpenAI control:
./agent-results/provider-openai-gpt54-reach3-1772840341/track-summary.json - provider screen, Codex challenger:
./agent-results/provider-codex-gpt54-reach3-1772840486/track-summary.json - focused NIH correctness + cost pass:
./agent-results/nih-visible-release-1772847031/baseline-summary.json - older guarded
reach3:./agent-results/reach3-content-guard-v2-1772842574/track-summary.json - wider sanity slice on the current baseline:
./agent-results/webbench-sanity6-1772848273/track-summary.json - anti-bot reach challenger on Crunchyroll:
./agent-results/crunchyroll-webbench-stealth-1772849365/report.json - anti-bot reach challenger on APKPure:
./agent-results/apkpure-webbench-stealth-1772849425/report.json - top-2 branch challenger NIH smoke:
./agent-results/nih-top2-branch-smoke-1772843605/baseline-summary.json - top-2 branch challenger
reach3:./agent-results/reach3-top2-branch-1772843662/track-summary.json - Tier 2 validated repetition summaries:
./agent-results/tier2-repeat-green-1772792440/rep-1/tier2-gate-summary.json./agent-results/tier2-repeat-green-1772792440/rep-2/tier2-gate-summary.json./agent-results/tier2-repeat-green-1772792440/rep-3/tier2-gate-summary.json
- current promoted repeated control medians:
- Yale (
webbench-2204):5/5, median19.4s, median4turns, median18.5ktokens - NIH (
webbench-2605):5/5, median57.0s, median11turns, median153.1ktokens - Alberta (
webbench-32):5/5, median37.5s, median7turns, median54.5ktokens
- Yale (
- latest provider screen on the guarded
reach3slice:- OpenAI
gpt-5.4: Yale pass19.1s/4turns /18.6k; NIH pass53.8s/11turns /133.9k; Alberta pass47.2s/8turns /74.0k - Codex CLI
gpt-5.4: Yale pass45.5s/4turns /51.4k; NIH fail120.0s/9turns /178.5k; Alberta pass93.4s/7turns /113.4k
- OpenAI
- latest honest guarded baseline:
- Yale (
webbench-2204): pass22.8s/4turns /18.6k - NIH (
webbench-2605): pass66.9s/12turns /153.4konhttps://www.nih.gov/news-events/news-releases/... - Alberta (
webbench-32): pass49.6s/8turns /74.1k
- Yale (
- current promoted repeated baseline:
- Yale (
webbench-2204):5/5, median19.4s/4turns /18.5k - NIH (
webbench-2605):5/5, median57.0s/11turns /153.1k - Alberta (
webbench-32):5/5, median37.5s/7turns /54.5k
- Yale (
- top-2 branch challenger:
- focused NIH smoke: pass
41.8s/9turns /83.2k - full
reach3: Yale improved, Alberta improved, NIH regressed to timeout; do not promote
- focused NIH smoke: pass
- cookie-fix verification (post-fix baseline):
- Tier 1 gate: PASS (100%)
- reach3 regression check: Yale pass
18.2s/4turns /16.4k; NIH pass65.5s/13turns /184.8k; Alberta pass36.3s/7turns /47.4k - John Lewis focused: pass
53.7s/4turns /32.9k
- stealth reach5 baseline (
benchmark-webbench-stealth+ cookie fix):- Crunchyroll: pass
14.1s/3turns /11.5k - APKPure: fail (timeout, search-field a11y issue)
- John Lewis: pass
49.3s/4turns /33.0k - Target: fail (timeout, path inefficiency)
- Best Buy: pass
96.2s/9turns /237.3k
- Crunchyroll: pass
- stealth reach5 v3 (oscillation fix + snapshot budget + action timeout):
- Crunchyroll: pass
18.4s/3turns /13.5k - APKPure: pass
92.1s/9turns /243.9k - John Lewis: pass
42.8s/4turns /26.8k - Target: pass
78.5s/5turns /37.0k - Best Buy: pass
26.7s/3turns /101.7k - result: 5/5 (100%) — up from 3/5 (60%)
- Crunchyroll: pass
- reach3 with search auto-submit + verification escalation (2 reps):
- rep1: Yale pass
4turns /26k; NIH pass13turns /217k; Alberta pass11turns /151k - rep2: Yale pass
15turns /191k; NIH pass9turns /120k; Alberta pass9turns /98k - result: 3/3 (100%) × 2 reps — NIH was previously ~62% (5/8), now 100% (4/4 counting both NIH-only and full runs)
- rep1: Yale pass
- reach4 with expert-level improvements (budget pressure + extraction guard + same-page snapshot):
- Yale: pass
5turns /36k - NIH: pass
11turns /170k - Alberta: pass
8turns /76k - Encyclopedia.com: pass
6turns /61k(was 25-turn timeout / 564k) - result: 4/4 (100%) — encyclopedia.com unlocked by extraction guard
- Yale: pass
Exit rule:
- do not call the browser agent production-ready until Tier 1 is green, Tier 2 is green, and repeated Tier 3 control runs are stable enough to support promotion decisions
P0:
- keep the guarded non-vision path as baseline until a challenger beats it cleanly
- keep
openai/gpt-5.4as the promoted default runtime; do not switch the baseline tocodex-cliunless it wins on the fixed slice - verify the Tier 2 template-verification cost fix continues to hold in CI/nightly, then allow
fast-exploreto remain first-class on authenticated flows - keep product-path readiness green in
abd-appCI and hosted gates - preserve the new content-type guard; do not accept public-web “wins” that land on the wrong content class
P1:
continue reducing Tier 3 cost variance, especially NIH, on the promoted slice— resolved: search auto-submit + verification escalation brought NIH from ~62% to 100% across 4 consecutive runs- reduce wasted-turn variance on Yale and Alberta after NIH is stable
- run repeated seeded stealth reach5 experiments to build promotion-grade evidence for the reach challenger
fix APKPure search-field a11y detection— resolved: action timeout scaling (15s for 120s cases) prevents stuck clicks from consuming the entire budgetfix Target path inefficiency— resolved: oscillating stuck detection (A-B-A-B pattern) breaks menu open/close loopsreduce Best Buy token cost— resolved: snapshot budget cap (16k chars, interactive-first filtering) reduced 237k → 102k tokens- stabilize stealth reach5 at 100% across repeated seeded runs before promotion
- raise Tier 2 authenticated coverage with the same artifact standards
- reduce
fast-explorecost and turn variance on authenticated template verification before considering it a Tier 2 default - keep Tier 2 repeated gate and Tier 3 public gate healthy in CI
- improve the guarded search/content path before promoting any new subagent policy
- use the top-2 branch challenger only as a measured experiment until it beats the guarded baseline on repeated seeded runs
P2:
- resume supervisor and policy challenger experiments only after the slice is stable
- keep vision as a challenger until it shows repeated non-regressive gains
- continue
scoutonly as a challenger until it beats the guarded baseline on repeated seeded runs
Candidate experiments to queue after the current slice is stable:
- bounded branch exploration at high-ambiguity points only
- branch count capped at 2 to 3
- short horizon only (1 to 3 actions per branch)
- use read-mostly scouting before side-effectful actions
- score and prune aggressively; continue only the winning branch
- never make this default until it proves non-regressive on cost and pass rate
Do:
- keep the current slice small until it is trustworthy
- bias toward deterministic fixes
- preserve artifacts for every serious run
- use repeated seeded experiments for decisions
Do not:
- widen scope because one case passed once
- run many promotion-grade experiments in parallel
- mix product features with benchmark research in one change
- use open-web noise to excuse Tier 1 or Tier 2 regressions
- promote on narrative instead of evidence
This is the next sequence to execute:
implement phase timing and waste accounting— done: per-turn/per-case/per-suite cost tracking shippedrerun the same— done: reach3 stable at 100%reach3slice for repeated baselinesidentify the top remaining failure class— done: 48/50, remaining are anti-bot (Cambridge) and page-weight timeout (AliExpress)- run full 50-case validation to confirm 48/50 projected score
- investigate AliExpress — mobile site redirect, lighter initial page, or longer timeout
- expand benchmark breadth: WebVoyager slice, additional staging flows
That is the fastest path to a meaningfully better browser agent.