feat(tau2/vikingbot): benchmark updates#2244
Conversation
Adds benchmark/tau2/vikingbot/, an end-to-end harness that runs the full
VikingBot AgentLoop on tau2-bench tasks and commits trajectories back into
OpenViking memory for epoch-based self-improvement. This complements the
existing memory-retrieval harness in benchmark/tau2/ (which is retrieval-only).
Contents:
- scripts/vikingbot_tau2_runner.py: run one tau2 task through the agent loop
(tau2 tool registry swap, simulated-time patch, advisory memory scope guard).
- scripts/run_tau2_domain.sh / run_eval_reward.sh: run a domain split with
bounded concurrency and score average reward.
- scripts/commit_trajectory_to_memory.py: commit train trajectories to memory.
- scripts/stat_trajectory.py, check_openviking_tool_calls.py: analysis helpers.
- tau2_env/: tau2 environment + tool-provider integration.
- run_full_test.sh and run_{airline,retail}_*epochs.sh: full / multi-epoch runs.
- setup_env.sh, README.md, .gitignore.
tau2-bench is referenced as an external dependency (cloned + installed by the
user); no OpenViking core changes are required. The runner is API-compatible
with bot/vikingbot on current main.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Mirror the two evaluation approaches as sibling subfolders under benchmark/tau2/: - llm/: the existing OpenViking Memory V2 retrieval harness, moved from benchmark/tau2/. All internal benchmark/tau2/... path references and the REPO_ROOT depth computations (run_full_eval.sh, tau2_common.py, run_memory_v2_eval.py) are updated for the extra directory level. - vikingbot/: the VikingBot agent runner (added in the previous commit). vikingbot/ cleanup: - make memory-block extraction time-independent: anchor on the stable session header and trailing reply instruction instead of a fixed simulated timestamp (the sim-time patch was removed, so the current time is now system-generated). - drop the now-removed sim-time / scope-guard notes from the README. - remove the unused stat_trajectory.py and check_openviking_tool_calls.py helpers. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…lagents, doc updates - run_full_test.sh: run train once per epoch (experience extraction) and test N times in parallel (--test-repeats, default 8), reporting the averaged test accuracy; keep --commit/--no-commit. - tau2_environment.py: remove the unused smolagents Tool path (CommunicateWithUser / create_tool_from_json_schema / self.tools); communicate_with_user is handled directly in tool_call. tau2-bench has no smolagents dependency, so it is dropped. - README: reorder install (tau2-bench first so setup_env can derive TAU2_DATA_ROOT), explain train-once/test-8x methodology and train-only memory extraction, document the required bot/vikingbot core changes (agent_id isolation + agent-experience memory), fix sibling links to ../llm/. - Remove run_retail_3epochs.sh. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…efactor setup_env.sh now does full environment setup in a single `source`: creates a fresh repo-root .venv, clones tau2-bench (external dep), installs openviking + vikingbot (pip install -e ., runs the Cargo build) + tau2-bench + smolagents, then activates and exports the runtime env vars. Idempotent via a marker file; supports --reinstall. README updated to document the one-step flow and the overridable env vars. Also move the communicate_with_user tool into a CommunicateWithUser class in tau2_environment.py (owns both schema and execution) and drop the duplicated inline schema from tau2_tool_provider.py. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ifications Backport the environment-setup fixes and README clarifications discovered while running the harness end-to-end (the core bot/vikingbot code changes live on the test/tau2-vikingbot-core-changes branch, not here): - setup_env.sh: install the [bot] extra (prompt_toolkit/gradio/mcp/...), build + bundle ragfs_python via maturin when the editable install skips it under pip build isolation, and install tau2-bench with the [gym] extra (gymnasium) - README.md: explain the server port (default 1933 vs bot.ov_server.server_url) and show the None-safe forms of the Change-1 diffs Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
PR Code Suggestions ✨No code suggestions found for the PR. |
PR title
feat(benchmark/tau2): add VikingBot agent runner and split harness into llm/ + vikingbot/
PR body
Summary
This PR does two things under
benchmark/tau2/:benchmark/tau2/vikingbot/— a harness that runs the full VikingBot agent(
bot/vikingbotAgentLoop) end-to-end ontau2-bench tasks, then commits the resulting
trajectories back into OpenViking memory so the agent can self-improve across epochs
(cold start → memory-augmented runs). Memory is extracted only from the
trainsplit; thetestsplit is held out to measure the improvement once that memory is injected (no test-setleakage).
benchmark/tau2/into two sibling subfolders so the two evaluationapproaches are cleanly separated.
Repository layout
llm/is the previous top-levelbenchmark/tau2/content (README.md,config/,scripts/,run_full_eval.sh,.gitignore), moved viagit mvso history is preserved. Only pathplumbing changed for the extra directory level: internal
benchmark/tau2/...references wererepointed to
benchmark/tau2/llm/..., and theREPO_ROOTdepth computations inrun_full_eval.sh,scripts/tau2_common.py, andscripts/run_memory_v2_eval.pywere updated.No eval logic changed.
vikingbot/is new.The two harnesses are complementary, not duplicates. Both are multi-turn and exercise OpenViking
memory extraction + retrieval; they differ in which agent drives the tasks:
benchmark/tau2/llm/(moved) — tau2-bench's native ReAct agent, wired to OpenViking memory,to measure the effect of that memory on task performance.
benchmark/tau2/vikingbot/(new) — an end-to-end, self-improving agent eval: runs the fullVikingBot agent loop and commits trajectories back into memory so the agent improves across epochs.
What's included (
vikingbot/)scripts/vikingbot_tau2_runner.py— run one tau2 task through the agent loop.scripts/run_tau2_domain.sh,scripts/run_eval_reward.sh— run a domain split with boundedconcurrency and score average reward.
scripts/commit_trajectory_to_memory.py— commit train trajectories into OpenViking memory.tau2_env/— tau2 environment + tool-provider integration.run_full_test.sh— one epoch: 1 train run + 8 test runs in parallel (--test-repeats, default8), with test accuracy averaged over the repeats.
run_airline_2epochs.sh— multi-epoch examples.
setup_env.sh,README.md,.gitignore.Notes
commit live entirely in the runner. The runner uses existing
bot/vikingbotAPIs on currentmain(AgentLoop(eval=...),_run_agent_loop,context.build_messages(..., memory_users=...),cli.commands._init_bot_data/_make_provider,utils.helpers.get_source_workspace_path).bot/vikingbotcore changes are required to reproduce: per-domain workspace isolationvia
agent_id, and reading only v2 agent memory (not user memory) at system-prompt buildtime. These are documented with diffs in
vikingbot/README.mdbut are not included as codein this PR — apply them to your
bot/vikingbotcheckout.llm/move changes only path references, not behavior. A repo-wide check found no codeoutside
benchmark/tau2/that references the moved paths.packages beyond it). The user-simulator uses an
OpenAI-compatible endpoint (
OPENAI_API_KEY/OPENAI_API_BASE).result*/,trajectory*/), the externaltau2-bench/checkout, logs andreports are git-ignored.
Testing
bash -npasses on all shell scripts (bothllm/andvikingbot/).python3 -m py_compilepasses on all Python modules.llm/reorg verified: everybenchmark/tau2/...reference repointed tobenchmark/tau2/llm/...,REPO_ROOTdepth computations fixed, and no external references to the moved paths remain.bot/vikingbotAPI surface verified against currentmainby source inspection.tau2-bench, the two documentedbot/vikingbotcore changes applied, and a running OpenViking server started with
--with-bot.