feat(tau2/vikingbot): benchmark updates by nicoleqiwt · Pull Request #2244 · volcengine/OpenViking

nicoleqiwt · 2026-05-26T10:34:45Z

PR title

feat(benchmark/tau2): add VikingBot agent runner and split harness into llm/ + vikingbot/

PR body

Summary

This PR does two things under benchmark/tau2/:

Adds benchmark/tau2/vikingbot/ — a harness that runs the full VikingBot agent
(bot/vikingbot AgentLoop) end-to-end on
tau2-bench tasks, then commits the resulting
trajectories back into OpenViking memory so the agent can self-improve across epochs
(cold start → memory-augmented runs). Memory is extracted only from the train split; the
test split is held out to measure the improvement once that memory is injected (no test-set
leakage).
Reorganizes benchmark/tau2/ into two sibling subfolders so the two evaluation
approaches are cleanly separated.

Repository layout

benchmark/tau2/
├── llm/        # existing OpenViking Memory V2 retrieval harness (moved here, behavior unchanged)
└── vikingbot/  # new: full VikingBot agent runner

llm/ is the previous top-level benchmark/tau2/ content (README.md, config/, scripts/,
run_full_eval.sh, .gitignore), moved via git mv so history is preserved. Only path
plumbing changed for the extra directory level: internal benchmark/tau2/... references were
repointed to benchmark/tau2/llm/..., and the REPO_ROOT depth computations in
run_full_eval.sh, scripts/tau2_common.py, and scripts/run_memory_v2_eval.py were updated.
No eval logic changed.
vikingbot/ is new.

The two harnesses are complementary, not duplicates. Both are multi-turn and exercise OpenViking
memory extraction + retrieval; they differ in which agent drives the tasks:

benchmark/tau2/llm/ (moved) — tau2-bench's native ReAct agent, wired to OpenViking memory,
to measure the effect of that memory on task performance.
benchmark/tau2/vikingbot/ (new) — an end-to-end, self-improving agent eval: runs the full
VikingBot agent loop and commits trajectories back into memory so the agent improves across epochs.

What's included (`vikingbot/`)

scripts/vikingbot_tau2_runner.py — run one tau2 task through the agent loop.
scripts/run_tau2_domain.sh, scripts/run_eval_reward.sh — run a domain split with bounded
concurrency and score average reward.
scripts/commit_trajectory_to_memory.py — commit train trajectories into OpenViking memory.
tau2_env/ — tau2 environment + tool-provider integration.
run_full_test.sh — one epoch: 1 train run + 8 test runs in parallel (--test-repeats, default
8), with test accuracy averaged over the repeats. run_airline_2epochs.sh
— multi-epoch examples.
setup_env.sh, README.md, .gitignore.

Notes

Runner-level adaptations — tau2 tool registry and epoch-based memory
commit live entirely in the runner. The runner uses existing
bot/vikingbot APIs on current main (AgentLoop(eval=...), _run_agent_loop,
context.build_messages(..., memory_users=...), cli.commands._init_bot_data/_make_provider,
utils.helpers.get_source_workspace_path).
Two bot/vikingbot core changes are required to reproduce: per-domain workspace isolation
via agent_id, and reading only v2 agent memory (not user memory) at system-prompt build
time. These are documented with diffs in vikingbot/README.md but are not included as code
in this PR — apply them to your bot/vikingbot checkout.
The llm/ move changes only path references, not behavior. A repo-wide check found no code
outside benchmark/tau2/ that references the moved paths.
tau2-bench is an external dependency, cloned and installed by the user (not vendored, no extra
packages beyond it). The user-simulator uses an
OpenAI-compatible endpoint (OPENAI_API_KEY / OPENAI_API_BASE).
Generated artifacts (result*/, trajectory*/), the external tau2-bench/ checkout, logs and
reports are git-ignored.

Testing

bash -n passes on all shell scripts (both llm/ and vikingbot/).
python3 -m py_compile passes on all Python modules.
llm/ reorg verified: every benchmark/tau2/... reference repointed to benchmark/tau2/llm/...,
REPO_ROOT depth computations fixed, and no external references to the moved paths remain.
Required bot/vikingbot API surface verified against current main by source inspection.
Full runtime requires the external dependency tau2-bench, the two documented bot/vikingbot
core changes applied, and a running OpenViking server started with --with-bot.

Adds benchmark/tau2/vikingbot/, an end-to-end harness that runs the full VikingBot AgentLoop on tau2-bench tasks and commits trajectories back into OpenViking memory for epoch-based self-improvement. This complements the existing memory-retrieval harness in benchmark/tau2/ (which is retrieval-only). Contents: - scripts/vikingbot_tau2_runner.py: run one tau2 task through the agent loop (tau2 tool registry swap, simulated-time patch, advisory memory scope guard). - scripts/run_tau2_domain.sh / run_eval_reward.sh: run a domain split with bounded concurrency and score average reward. - scripts/commit_trajectory_to_memory.py: commit train trajectories to memory. - scripts/stat_trajectory.py, check_openviking_tool_calls.py: analysis helpers. - tau2_env/: tau2 environment + tool-provider integration. - run_full_test.sh and run_{airline,retail}_*epochs.sh: full / multi-epoch runs. - setup_env.sh, README.md, .gitignore. tau2-bench is referenced as an external dependency (cloned + installed by the user); no OpenViking core changes are required. The runner is API-compatible with bot/vikingbot on current main. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Mirror the two evaluation approaches as sibling subfolders under benchmark/tau2/: - llm/: the existing OpenViking Memory V2 retrieval harness, moved from benchmark/tau2/. All internal benchmark/tau2/... path references and the REPO_ROOT depth computations (run_full_eval.sh, tau2_common.py, run_memory_v2_eval.py) are updated for the extra directory level. - vikingbot/: the VikingBot agent runner (added in the previous commit). vikingbot/ cleanup: - make memory-block extraction time-independent: anchor on the stable session header and trailing reply instruction instead of a fixed simulated timestamp (the sim-time patch was removed, so the current time is now system-generated). - drop the now-removed sim-time / scope-guard notes from the README. - remove the unused stat_trajectory.py and check_openviking_tool_calls.py helpers. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…lagents, doc updates - run_full_test.sh: run train once per epoch (experience extraction) and test N times in parallel (--test-repeats, default 8), reporting the averaged test accuracy; keep --commit/--no-commit. - tau2_environment.py: remove the unused smolagents Tool path (CommunicateWithUser / create_tool_from_json_schema / self.tools); communicate_with_user is handled directly in tool_call. tau2-bench has no smolagents dependency, so it is dropped. - README: reorder install (tau2-bench first so setup_env can derive TAU2_DATA_ROOT), explain train-once/test-8x methodology and train-only memory extraction, document the required bot/vikingbot core changes (agent_id isolation + agent-experience memory), fix sibling links to ../llm/. - Remove run_retail_3epochs.sh. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…efactor setup_env.sh now does full environment setup in a single `source`: creates a fresh repo-root .venv, clones tau2-bench (external dep), installs openviking + vikingbot (pip install -e ., runs the Cargo build) + tau2-bench + smolagents, then activates and exports the runtime env vars. Idempotent via a marker file; supports --reinstall. README updated to document the one-step flow and the overridable env vars. Also move the communicate_with_user tool into a CommunicateWithUser class in tau2_environment.py (owns both schema and execution) and drop the duplicated inline schema from tau2_tool_provider.py. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…ifications Backport the environment-setup fixes and README clarifications discovered while running the harness end-to-end (the core bot/vikingbot code changes live on the test/tau2-vikingbot-core-changes branch, not here): - setup_env.sh: install the [bot] extra (prompt_toolkit/gradio/mcp/...), build + bundle ragfs_python via maturin when the editable install skips it under pip build isolation, and install tau2-bench with the [gym] extra (gymnasium) - README.md: explain the server port (default 1933 vs bot.ov_server.server_url) and show the None-safe forms of the Change-1 diffs Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

github-actions · 2026-05-26T10:35:56Z

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
🏅 Score: 75
🧪 No relevant tests
🔒 Security concerns No obvious security issues, but debug prints in commit_trajectory_to_memory.py could leak minor implementation details. Ensure no secrets are logged.
✅ No TODO sections
🔀 Multiple PR themes Sub-PR theme: Move existing tau2 harness into llm/ directory Relevant files: benchmark/tau2/llm/scripts/build_fixed_first_user_fixture.py benchmark/tau2/llm/scripts/tau2_common.py benchmark/tau2/llm/scripts/run_memory_v2_eval.py benchmark/tau2/llm/run_full_eval.sh benchmark/tau2/llm/scripts/setup_tau2_repo.sh benchmark/tau2/llm/README.md benchmark/tau2/llm/config/ Sub-PR theme: Add VikingBot agent runner for tau2-bench Relevant files: benchmark/tau2/vikingbot/
⚡ Recommended focus areas for review Error handling and data corruption risk Unregistering 'openviking_memory_commit' without checking if it exists can cause an AttributeError or KeyError, aborting the run. The code also accesses provider.env.env._get_reward(), which is a private API and may break with tau2 updates. agent.tools.unregister("openviking_memory_commit") for schema in provider.list_openai_tools(): agent.tools.register(Tau2Tool(schema, provider)) instructions = [] if policy: instructions.append(policy) instructions.append("Use the provided tools to interact with the environment.") if args.keep_default_tools: instructions.append("Before you attend to customer, you MUST read relevant agent memory that stores experiences distilled from similar tasks and carefully learn them.") instructions.append( "If you need to communicate with the user, you MUST call tool `communicate_with_user`." ) instructions.append("When the task is finished or terminated, call tool `done` first and output an ending content without using any tool calling for the next round to exit.") system_prompt = "\n".join(instructions) user_prompt = user_query session_id = args.session or f"tau2_{args.data_split}_{args.task_no}" session_key = SessionKey(type="cli", channel_id="tau2", chat_id=session_id) messages_output_path = _derive_messages_path(output_path) final_content, final_reasoning_content, tools_used, token_usage, iteration, memory_content = asyncio.run( _run_agent( agent, system_prompt, user_prompt, session_key, args.sender, args.agent_id, args.keep_default_tools, messages_output_path, ) ) reward = None evaluation_result = None if provider.env is not None: try: reward, evaluation_result = provider.env.env._get_reward() except Exception: pass Hardcoded model and private API usage The user LLM is hardcoded to 'openai/doubao-seed-2-0-pro-260215', which may not be available or intended for all users. Also uses private tau2 API env._get_reward() which is fragile. self.env = AgentGymEnv(domain=domain, task_id=task_id, user_llm="openai/doubao-seed-2-0-pro-260215") Excessively long wait time Waiting 9000 seconds (2.5 hours) between epochs is unnecessarily long, blocking the pipeline for no clear reason. Should use a more reasonable timeout or polling mechanism. WAIT_SECS=9000 log ">>> Waiting ${WAIT_SECS}s for server async memory commit to finish..." sleep "${WAIT_SECS}" Debug print statements Debug prints (client.agent_id, get_agent_space_name, client.client._agent_id) are present, which are unnecessary in production/benchmark code and may leak sensitive information. print(client.agent_id) print(client.get_agent_space_name("default")) print(client.client._agent_id) Error handling for data files Loading split_tasks.json without error handling for missing files, invalid JSON, or missing split keys can cause silent failures or unhandled exceptions. split_path = os.path.join(data_root, "domains", domain, "split_tasks.json") with open(split_path, "r", encoding="utf-8") as f: data = json.load(f) task_ids = data[split] task_id = task_ids[task_no] return domain, task_id

github-actions · 2026-05-26T10:37:58Z

PR Code Suggestions ✨

No code suggestions found for the PR.

ByteDance and others added 6 commits May 26, 2026 18:32

clean README message

c218ca4

github-project-automation Bot moved this to Backlog in OpenViking project May 26, 2026

github-project-automation Bot added this to OpenViking project May 26, 2026

github-actions Bot added the Review effort 4/5 label May 26, 2026

chenjw approved these changes May 26, 2026

View reviewed changes

chenjw merged commit e0ce670 into volcengine:main May 26, 2026
1 check passed

github-project-automation Bot moved this from Backlog to Done in OpenViking project May 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(tau2/vikingbot): benchmark updates#2244

feat(tau2/vikingbot): benchmark updates#2244
chenjw merged 6 commits into
volcengine:mainfrom
nicoleqiwt:feat/tau2-vikingbot-benchmark

nicoleqiwt commented May 26, 2026

Uh oh!

github-actions Bot commented May 26, 2026

Uh oh!

github-actions Bot commented May 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

nicoleqiwt commented May 26, 2026

PR title

PR body

Summary

Repository layout

What's included (vikingbot/)

Notes

Testing

Uh oh!

github-actions Bot commented May 26, 2026

PR Reviewer Guide 🔍

Uh oh!

github-actions Bot commented May 26, 2026

PR Code Suggestions ✨

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

What's included (`vikingbot/`)