Skip to content

feat(tau2/vikingbot): benchmark updates#2244

Merged
chenjw merged 6 commits into
volcengine:mainfrom
nicoleqiwt:feat/tau2-vikingbot-benchmark
May 26, 2026
Merged

feat(tau2/vikingbot): benchmark updates#2244
chenjw merged 6 commits into
volcengine:mainfrom
nicoleqiwt:feat/tau2-vikingbot-benchmark

Conversation

@nicoleqiwt
Copy link
Copy Markdown
Contributor

PR title

feat(benchmark/tau2): add VikingBot agent runner and split harness into llm/ + vikingbot/

PR body

Summary

This PR does two things under benchmark/tau2/:

  1. Adds benchmark/tau2/vikingbot/ — a harness that runs the full VikingBot agent
    (bot/vikingbot AgentLoop) end-to-end on
    tau2-bench tasks, then commits the resulting
    trajectories back into OpenViking memory so the agent can self-improve across epochs
    (cold start → memory-augmented runs). Memory is extracted only from the train split; the
    test split is held out to measure the improvement once that memory is injected (no test-set
    leakage).
  2. Reorganizes benchmark/tau2/ into two sibling subfolders so the two evaluation
    approaches are cleanly separated.

Repository layout

benchmark/tau2/
├── llm/        # existing OpenViking Memory V2 retrieval harness (moved here, behavior unchanged)
└── vikingbot/  # new: full VikingBot agent runner
  • llm/ is the previous top-level benchmark/tau2/ content (README.md, config/, scripts/,
    run_full_eval.sh, .gitignore), moved via git mv so history is preserved. Only path
    plumbing changed for the extra directory level: internal benchmark/tau2/... references were
    repointed to benchmark/tau2/llm/..., and the REPO_ROOT depth computations in
    run_full_eval.sh, scripts/tau2_common.py, and scripts/run_memory_v2_eval.py were updated.
    No eval logic changed.
  • vikingbot/ is new.

The two harnesses are complementary, not duplicates. Both are multi-turn and exercise OpenViking
memory extraction + retrieval; they differ in which agent drives the tasks:

  • benchmark/tau2/llm/ (moved) — tau2-bench's native ReAct agent, wired to OpenViking memory,
    to measure the effect of that memory on task performance.
  • benchmark/tau2/vikingbot/ (new) — an end-to-end, self-improving agent eval: runs the full
    VikingBot agent loop and commits trajectories back into memory so the agent improves across epochs.

What's included (vikingbot/)

  • scripts/vikingbot_tau2_runner.py — run one tau2 task through the agent loop.
  • scripts/run_tau2_domain.sh, scripts/run_eval_reward.sh — run a domain split with bounded
    concurrency and score average reward.
  • scripts/commit_trajectory_to_memory.py — commit train trajectories into OpenViking memory.
  • tau2_env/ — tau2 environment + tool-provider integration.
  • run_full_test.sh — one epoch: 1 train run + 8 test runs in parallel (--test-repeats, default
    8), with test accuracy averaged over the repeats. run_airline_2epochs.sh
    — multi-epoch examples.
  • setup_env.sh, README.md, .gitignore.

Notes

  • Runner-level adaptations — tau2 tool registry and epoch-based memory
    commit live entirely in the runner. The runner uses existing
    bot/vikingbot APIs on current main (AgentLoop(eval=...), _run_agent_loop,
    context.build_messages(..., memory_users=...), cli.commands._init_bot_data/_make_provider,
    utils.helpers.get_source_workspace_path).
  • Two bot/vikingbot core changes are required to reproduce: per-domain workspace isolation
    via agent_id, and reading only v2 agent memory (not user memory) at system-prompt build
    time. These are documented with diffs in vikingbot/README.md but are not included as code
    in this PR
    — apply them to your bot/vikingbot checkout.
  • The llm/ move changes only path references, not behavior. A repo-wide check found no code
    outside benchmark/tau2/ that references the moved paths.
  • tau2-bench is an external dependency, cloned and installed by the user (not vendored, no extra
    packages beyond it). The user-simulator uses an
    OpenAI-compatible endpoint (OPENAI_API_KEY / OPENAI_API_BASE).
  • Generated artifacts (result*/, trajectory*/), the external tau2-bench/ checkout, logs and
    reports are git-ignored.

Testing

  • bash -n passes on all shell scripts (both llm/ and vikingbot/).
  • python3 -m py_compile passes on all Python modules.
  • llm/ reorg verified: every benchmark/tau2/... reference repointed to benchmark/tau2/llm/...,
    REPO_ROOT depth computations fixed, and no external references to the moved paths remain.
  • Required bot/vikingbot API surface verified against current main by source inspection.
  • Full runtime requires the external dependency tau2-bench, the two documented bot/vikingbot
    core changes applied, and a running OpenViking server started with --with-bot.

ByteDance and others added 6 commits May 26, 2026 18:32
Adds benchmark/tau2/vikingbot/, an end-to-end harness that runs the full
VikingBot AgentLoop on tau2-bench tasks and commits trajectories back into
OpenViking memory for epoch-based self-improvement. This complements the
existing memory-retrieval harness in benchmark/tau2/ (which is retrieval-only).

Contents:
- scripts/vikingbot_tau2_runner.py: run one tau2 task through the agent loop
  (tau2 tool registry swap, simulated-time patch, advisory memory scope guard).
- scripts/run_tau2_domain.sh / run_eval_reward.sh: run a domain split with
  bounded concurrency and score average reward.
- scripts/commit_trajectory_to_memory.py: commit train trajectories to memory.
- scripts/stat_trajectory.py, check_openviking_tool_calls.py: analysis helpers.
- tau2_env/: tau2 environment + tool-provider integration.
- run_full_test.sh and run_{airline,retail}_*epochs.sh: full / multi-epoch runs.
- setup_env.sh, README.md, .gitignore.

tau2-bench is referenced as an external dependency (cloned + installed by the
user); no OpenViking core changes are required. The runner is API-compatible
with bot/vikingbot on current main.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Mirror the two evaluation approaches as sibling subfolders under benchmark/tau2/:

- llm/: the existing OpenViking Memory V2 retrieval harness, moved from
  benchmark/tau2/. All internal benchmark/tau2/... path references and the
  REPO_ROOT depth computations (run_full_eval.sh, tau2_common.py,
  run_memory_v2_eval.py) are updated for the extra directory level.
- vikingbot/: the VikingBot agent runner (added in the previous commit).

vikingbot/ cleanup:
- make memory-block extraction time-independent: anchor on the stable session
  header and trailing reply instruction instead of a fixed simulated timestamp
  (the sim-time patch was removed, so the current time is now system-generated).
- drop the now-removed sim-time / scope-guard notes from the README.
- remove the unused stat_trajectory.py and check_openviking_tool_calls.py helpers.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…lagents, doc updates

- run_full_test.sh: run train once per epoch (experience extraction) and test
  N times in parallel (--test-repeats, default 8), reporting the averaged test
  accuracy; keep --commit/--no-commit.
- tau2_environment.py: remove the unused smolagents Tool path (CommunicateWithUser /
  create_tool_from_json_schema / self.tools); communicate_with_user is handled
  directly in tool_call. tau2-bench has no smolagents dependency, so it is dropped.
- README: reorder install (tau2-bench first so setup_env can derive TAU2_DATA_ROOT),
  explain train-once/test-8x methodology and train-only memory extraction, document
  the required bot/vikingbot core changes (agent_id isolation + agent-experience
  memory), fix sibling links to ../llm/.
- Remove run_retail_3epochs.sh.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…efactor

setup_env.sh now does full environment setup in a single `source`: creates a
fresh repo-root .venv, clones tau2-bench (external dep), installs openviking +
vikingbot (pip install -e ., runs the Cargo build) + tau2-bench + smolagents,
then activates and exports the runtime env vars. Idempotent via a marker file;
supports --reinstall. README updated to document the one-step flow and the
overridable env vars.

Also move the communicate_with_user tool into a CommunicateWithUser class in
tau2_environment.py (owns both schema and execution) and drop the duplicated
inline schema from tau2_tool_provider.py.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ifications

Backport the environment-setup fixes and README clarifications discovered while
running the harness end-to-end (the core bot/vikingbot code changes live on the
test/tau2-vikingbot-core-changes branch, not here):

- setup_env.sh: install the [bot] extra (prompt_toolkit/gradio/mcp/...), build +
  bundle ragfs_python via maturin when the editable install skips it under pip
  build isolation, and install tau2-bench with the [gym] extra (gymnasium)
- README.md: explain the server port (default 1933 vs bot.ov_server.server_url)
  and show the None-safe forms of the Change-1 diffs

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
🏅 Score: 75
🧪 No relevant tests
🔒 Security concerns

No obvious security issues, but debug prints in commit_trajectory_to_memory.py could leak minor implementation details. Ensure no secrets are logged.

✅ No TODO sections
🔀 Multiple PR themes

Sub-PR theme: Move existing tau2 harness into llm/ directory

Relevant files:

  • benchmark/tau2/llm/scripts/build_fixed_first_user_fixture.py
  • benchmark/tau2/llm/scripts/tau2_common.py
  • benchmark/tau2/llm/scripts/run_memory_v2_eval.py
  • benchmark/tau2/llm/run_full_eval.sh
  • benchmark/tau2/llm/scripts/setup_tau2_repo.sh
  • benchmark/tau2/llm/README.md
  • benchmark/tau2/llm/config/

Sub-PR theme: Add VikingBot agent runner for tau2-bench

Relevant files:

  • benchmark/tau2/vikingbot/

⚡ Recommended focus areas for review

Error handling and data corruption risk

Unregistering 'openviking_memory_commit' without checking if it exists can cause an AttributeError or KeyError, aborting the run. The code also accesses provider.env.env._get_reward(), which is a private API and may break with tau2 updates.

agent.tools.unregister("openviking_memory_commit")
for schema in provider.list_openai_tools():
    agent.tools.register(Tau2Tool(schema, provider))

instructions = []
if policy:
    instructions.append(policy)
instructions.append("Use the provided tools to interact with the environment.")
if args.keep_default_tools:
    instructions.append("Before you attend to customer, you MUST read relevant agent memory that stores experiences distilled from similar tasks and carefully learn them.")
instructions.append(
    "If you need to communicate with the user, you MUST call tool `communicate_with_user`."
)
instructions.append("When the task is finished or terminated, call tool `done` first and output an ending content without using any tool calling for the next round to exit.")

system_prompt = "\n".join(instructions)
user_prompt = user_query

session_id = args.session or f"tau2_{args.data_split}_{args.task_no}"
session_key = SessionKey(type="cli", channel_id="tau2", chat_id=session_id)

messages_output_path = _derive_messages_path(output_path)

final_content, final_reasoning_content, tools_used, token_usage, iteration, memory_content = asyncio.run(
    _run_agent(
        agent,
        system_prompt,
        user_prompt,
        session_key,
        args.sender,
        args.agent_id,
        args.keep_default_tools,
        messages_output_path,
    )
)

reward = None
evaluation_result = None
if provider.env is not None:
    try:
        reward, evaluation_result = provider.env.env._get_reward()
    except Exception:
        pass
Hardcoded model and private API usage

The user LLM is hardcoded to 'openai/doubao-seed-2-0-pro-260215', which may not be available or intended for all users. Also uses private tau2 API env._get_reward() which is fragile.

self.env = AgentGymEnv(domain=domain, task_id=task_id, user_llm="openai/doubao-seed-2-0-pro-260215")
Excessively long wait time

Waiting 9000 seconds (2.5 hours) between epochs is unnecessarily long, blocking the pipeline for no clear reason. Should use a more reasonable timeout or polling mechanism.

WAIT_SECS=9000
log ">>> Waiting ${WAIT_SECS}s for server async memory commit to finish..."
sleep "${WAIT_SECS}"
Debug print statements

Debug prints (client.agent_id, get_agent_space_name, client.client._agent_id) are present, which are unnecessary in production/benchmark code and may leak sensitive information.

print(client.agent_id)
print(client.get_agent_space_name("default"))
print(client.client._agent_id)
Error handling for data files

Loading split_tasks.json without error handling for missing files, invalid JSON, or missing split keys can cause silent failures or unhandled exceptions.

split_path = os.path.join(data_root, "domains", domain, "split_tasks.json")
with open(split_path, "r", encoding="utf-8") as f:
    data = json.load(f)
task_ids = data[split]
task_id = task_ids[task_no]
return domain, task_id

@github-actions
Copy link
Copy Markdown

PR Code Suggestions ✨

No code suggestions found for the PR.

@chenjw chenjw merged commit e0ce670 into volcengine:main May 26, 2026
1 check passed
@github-project-automation github-project-automation Bot moved this from Backlog to Done in OpenViking project May 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants