Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
d1caa77
feat(benchmark): add TAU-2 trajectory memory treatment
May 13, 2026
a68d5e7
style(benchmark): format tau2 trajectory scripts
May 13, 2026
0700781
Merge remote-tracking branch 'origin/main' into feat/tau2-trajectory-…
May 13, 2026
fddd7ba
refine trajectory memory view prompt
May 13, 2026
0496000
feat(benchmark): prepare tau2 memory corpora before eval
May 13, 2026
08d33a9
fix(benchmark): tighten trajectory evidence prompt
May 13, 2026
2b767f2
fix(benchmark): guard tau2 infrastructure failures
huangruiteng May 13, 2026
63a1004
fix(benchmark): resolve tau2 runner paths
huangruiteng May 13, 2026
fb62c46
fix(memory): add trajectory evidence examples
huangruiteng May 13, 2026
e30b79b
fix(benchmark): run no-memory tau2 eval in process
huangruiteng May 13, 2026
662cf0b
bench(tau2): align retrieval budget and fixed first user
huangruiteng May 14, 2026
f85d60b
bench(tau2): reuse memory corpora across eval runs
huangruiteng May 14, 2026
d833980
bench(tau2): add scoped trajectory eval concurrency
huangruiteng May 14, 2026
74c18db
style(benchmark): format tau2 eval runner
huangruiteng May 14, 2026
8cf7737
style(benchmark): satisfy tau2 eval lint
huangruiteng May 14, 2026
d8297c6
bench(tau2): harden trajectory memory eval variants
huangruiteng May 19, 2026
5a4f679
fix(tau2): rebuild search URI for reused corpora
huangruiteng May 19, 2026
f135e7f
docs(tau2): add PR-B reproduction commands
huangruiteng May 20, 2026
70031a4
chore(tau2): keep PR-B benchmark scope focused
huangruiteng May 20, 2026
b41cf4d
chore(tau2): keep trajectory prompt generic
huangruiteng May 20, 2026
1cdfead
chore(tau2): format benchmark scripts
huangruiteng May 20, 2026
d65a2f9
bench(tau2): restore operation-family trajectory protocol
huangruiteng May 20, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
129 changes: 118 additions & 11 deletions benchmark/tau2/README.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,13 @@
# TAU-2 Benchmark

This directory contains a small OpenViking-style entry point for TAU-2 memory
evaluation. The first version is intentionally narrow:
evaluation. The scope is intentionally narrow:

- fresh OpenViking Memory V2 experience-only baseline;
- Memory V2 pre-write recall treatment.
- trajectory memory retrieval treatment for the refined extraction prompt.

Trajectory / procedure-view prompts, category rerank, and other harness-only
diagnostics are intentionally left out of this first PR.
Category rerank and other harness-only diagnostics are intentionally left out.

## Layout

Expand All @@ -16,15 +16,18 @@ benchmark/tau2/
├── config/
│ ├── baseline.yaml
│ ├── official.yaml
│ └── prewrite.yaml
│ ├── prewrite.yaml
│ └── trajectory.yaml
├── scripts/
│ ├── run_eval.py
│ ├── setup_tau2_repo.sh
│ └── tau2_common.py
└── run_full_eval.sh
```

Generated artifacts are written to `benchmark/tau2/result/<run_id>/`.
Generated eval artifacts are written to `benchmark/tau2/result/<run_id>/`.
Memory corpus artifacts are cached outside the run id at
`benchmark/tau2/result/memory_corpora/` by default.

## Quick Start

Expand All @@ -37,6 +40,21 @@ export TAU2_REPO=/path/to/tau2-bench
export TAU2_CLI=/path/to/tau2
```

The default OpenViking TAU-2 memory evidence protocol is
`fixed_first_user_full8`: retail + airline, 8 repeats, same seeds, confirmation
aware user simulator, and fixed first user fixtures for both domains. Later user
simulator turns remain live. Set the fixture paths before running the default
configs:

```bash
export TAU2_RETAIL_FIXED_FIRST_USER_FILE=/path/to/retail/fixed_first_user_fixture.json
export TAU2_AIRLINE_FIXED_FIRST_USER_FILE=/path/to/airline/fixed_first_user_fixture.json
```

`--strict-preflight` fails when `eval.require_fixed_first_user=true` and either
fixture is missing. Use `config/official.yaml` for an explicit non-fixed,
official-live-user control.

For a local one-command setup, clone and install TAU-2 into ignored benchmark
directories:

Expand Down Expand Up @@ -77,6 +95,18 @@ benchmark/tau2/run_full_eval.sh \
--repeat-count 1
```

Plan a one-cell trajectory memory smoke:

```bash
benchmark/tau2/run_full_eval.sh \
--config benchmark/tau2/config/trajectory.yaml \
--domain retail \
--strategy-id memory_v2_trajectory_view \
--num-tasks 1 \
--train-num-tasks 1 \
--repeat-count 1
```

Run the Memory V2 8-trial matrix (`retail + airline` x 2 strategies x 8 repeats):

```bash
Expand All @@ -85,6 +115,42 @@ benchmark/tau2/run_full_eval.sh \
--execute
```

## Reproduce the PR-B Evidence

The PR-B headline and content-shape ablation use
`config/prb_content_matrix_new_prompt.yaml`. It runs the no-memory control plus
trajectory top4, experience top2, and representative 4000-character budget
ablation routes across `retail + airline` with 8 repeats.

First run one tiny end-to-end smoke against a clean local OpenViking service:

```bash
benchmark/tau2/run_full_eval.sh \
--config benchmark/tau2/config/prb_content_matrix_new_prompt.yaml \
--domain retail \
--strategy-id new_traj_fixed_first_user_prewrite \
--num-tasks 1 \
--train-num-tasks 1 \
--repeat-count 1 \
--strict-preflight \
--execute
```

Then run the full PR-B matrix:

```bash
benchmark/tau2/run_full_eval.sh \
--config benchmark/tau2/config/prb_content_matrix_new_prompt.yaml \
--run-id prb_content_matrix_new_prompt_full8 \
--strict-preflight \
--execute
```

The main result is written to
`benchmark/tau2/result/prb_content_matrix_new_prompt_full8/scoreboard.json`.
Per-cell outputs live under `cell_results/`; corpus identity and generated
memory checks live under `memory_corpora/`.

For a small E2E smoke, keep both the eval and train slices tiny:

```bash
Expand All @@ -103,21 +169,54 @@ and `OPENAI_API_BASE` for LiteLLM before running upstream TAU-2.

Start the OpenViking service before executing memory cells, and verify it with
`ov status`. For evidence runs, use a clean OpenViking workspace/config and set
`OPENVIKING_URL` explicitly so local custom memory templates do not pollute the
Memory V2 baseline.
`OPENVIKING_URL` explicitly so local template overrides do not pollute the
Memory V2 baseline. For trajectory memory evidence, start the service from this
branch and inspect generated trajectory files; changing `search_uri` alone does
not prove the new trajectory prompt was used.

## Memory Adapter

`memory_v2_experience_only` and `memory_v2_prewrite` cells run through a small
TAU-2 agent adapter in this directory:
Memory V2 cells run through a small TAU-2 agent adapter in this directory:

- train by writing TAU-2 training conversations into OpenViking sessions;
- evaluate by retrieving OpenViking experience memory at the first user turn;
- evaluate by retrieving OpenViking memory at the first user turn;
- for pre-write recall, retrieve again before write-like tool calls and
regenerate that step with the matched memories;
regenerate that step with the matched memories. The default benchmark
retrieves 6 pre-write candidates and injects 2, which keeps extra candidates
visible in traces without expanding the prompt budget;
- optionally run an explicit scope-prompt treatment that keeps retrieved
memories advisory and asks the agent to preserve the current task scope before
write-like tool calls. Configs provide per-domain files through
`scope_prompt_files`;
- emit artifact metadata to identify the OpenViking account, agent,
corpus, retrieval mode, and simulator policy used by each cell.

For exploratory gates, prefer a bounded run with `--cell-timeout-seconds`.
Timed-out cells are recorded with return code `124`, `timed_out=true`, and are
excluded from scoreboard metrics, which keeps smoke runs from silently becoming
long-running evidence jobs.

The existing `train_memory_mode: experience_only` value selects the Memory V2
session-commit path. `search_memory_type` selects which generated memory bucket
is retrieved during eval (`experiences` by default, `trajectories` for
`config/trajectory.yaml`). The runner prepares each distinct
`domain + corpus_id` once and reuses it across eval run ids when the cached
`corpus_manifest.json` is present. Different corpora may be prepared in
parallel with `benchmark.corpus_prepare_concurrency`; session commits inside one
corpus remain serial to preserve OpenViking write semantics.

By default, trajectory extraction is transcript-only: the runner replays TAU-2
messages into an OpenViking session and does not expose held-out reward or
assertion results to the extractor. The PR-B evidence config can also use a
structured role/tool transcript, include the domain policy in the training
session, skip failed train sessions when building positive procedure memory, and
cap injected memory by total character budget for content-shape ablations.

Eval cells run in parallel with `benchmark.strategy_concurrency` by default and
can be overridden with `--strategy-concurrency`. This only parallelizes read-only
TAU-2 eval cells; corpus writes inside one corpus are still serialized by the
prepare step.

## User Simulator Policy

The runner default is the official TAU-2 user simulator if
Expand All @@ -131,6 +230,14 @@ confirmation boundary to the TAU-2 user simulator guidelines; metadata such as
the upstream PR link is kept in run artifacts, not in the simulator prompt.
Reference: [sierra-research/tau2-bench#297](https://github.com/sierra-research/tau2-bench/pull/297).

Optional fixed-first-user fixtures keep the first simulated user turn stable
while preserving live simulator behavior after that turn:

```bash
export TAU2_RETAIL_FIXED_FIRST_USER_FILE=/path/to/retail_fixture.json
export TAU2_AIRLINE_FIXED_FIRST_USER_FILE=/path/to/airline_fixture.json
```

Use `config/official.yaml` with a clean TAU-2 checkout when you need an
official-user-simulator parity run. If the checkout was already patched, the
artifact records that boundary instead of labeling the run pure official.
Expand Down
18 changes: 18 additions & 0 deletions benchmark/tau2/config/baseline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,9 @@ benchmark:
train_split_name: train
eval_split_name: test
repeat_count: 8
strategy_concurrency: 8
task_max_concurrency: 10
corpus_prepare_concurrency: 2
max_steps: 200
seed: 300
agent: llm_agent
Expand All @@ -17,12 +19,25 @@ paths:
tau2_repo: ${TAU2_REPO:-data/external_benchmarks/tau2-bench}
tau2_cli: ${TAU2_CLI:-tau2}
output_dir: benchmark/tau2/result
# Corpus writes are expensive and should be reused across eval run ids when
# the train split and memory prompt/config did not change.
corpus_cache_dir: benchmark/tau2/result/memory_corpora

eval:
# Default OpenViking TAU-2 memory evidence uses the fixed-first-user full8
# protocol: retail + airline, 8 repeats, same seeds, first user turn pinned by
# fixtures, later user simulator turns still live.
protocol: fixed_first_user_full8
require_fixed_first_user: true
# The runner default is official if this field is omitted. The OpenViking
# memory benchmark config opts into a confirmation-aware TAU-2 user simulator
# prompt; run_eval.py applies that small prompt patch idempotently when needed.
user_simulator_policy: confirmation_aware
# Fixed-first-user fixtures keep the first simulated user turn stable while
# leaving later turns live. Main PR-B evidence requires these env vars.
fixed_first_user_fixtures:
retail: ${TAU2_RETAIL_FIXED_FIRST_USER_FILE:-}
airline: ${TAU2_AIRLINE_FIXED_FIRST_USER_FILE:-}

model:
agent_llm: ${TAU2_AGENT_LLM:-openai/doubao-seed-2-0-pro-260215}
Expand All @@ -33,7 +48,10 @@ openviking:
url: ${OPENVIKING_URL:-http://localhost:1933}
account: ${OPENVIKING_ACCOUNT:-default}
agent_id: ${OPENVIKING_AGENT_ID:-tau2-openviking-agent}
reuse_corpus_across_runs: true
retrieval_top_k: 4
prewrite_retrieval_top_k: 6
prewrite_inject_top_k: 2
replay_write_policy: read_only

strategies:
Expand Down
9 changes: 9 additions & 0 deletions benchmark/tau2/config/no_memory.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
extends: baseline.yaml

benchmark:
name: tau2_openviking_no_memory

strategies:
- id: no_memory
label: TAU-2 no-memory baseline
memory_backend: none
5 changes: 5 additions & 0 deletions benchmark/tau2/config/official.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4,4 +4,9 @@ benchmark:
name: tau2_openviking_official_user_simulator

eval:
protocol: official_live_user
require_fixed_first_user: false
user_simulator_policy: official
fixed_first_user_fixtures:
retail:
airline:
Loading
Loading