volcengine · yangxinxin-7 · May 21, 2026 · May 13, 2026 · May 13, 2026 · May 13, 2026
diff --git a/benchmark/tau2/README.md b/benchmark/tau2/README.md
@@ -1,13 +1,13 @@
 # TAU-2 Benchmark
 
 This directory contains a small OpenViking-style entry point for TAU-2 memory
-evaluation. The first version is intentionally narrow:
+evaluation. The scope is intentionally narrow:
 
 - fresh OpenViking Memory V2 experience-only baseline;
 - Memory V2 pre-write recall treatment.
+- trajectory memory retrieval treatment for the refined extraction prompt.
 
-Trajectory / procedure-view prompts, category rerank, and other harness-only
-diagnostics are intentionally left out of this first PR.
+Category rerank and other harness-only diagnostics are intentionally left out.
 
 ## Layout
 
@@ -16,15 +16,18 @@ benchmark/tau2/
 ├── config/
 │   ├── baseline.yaml
 │   ├── official.yaml
-│   └── prewrite.yaml
+│   ├── prewrite.yaml
+│   └── trajectory.yaml
 ├── scripts/
 │   ├── run_eval.py
 │   ├── setup_tau2_repo.sh
 │   └── tau2_common.py
 └── run_full_eval.sh
 ```
 
-Generated artifacts are written to `benchmark/tau2/result/<run_id>/`.
+Generated eval artifacts are written to `benchmark/tau2/result/<run_id>/`.
+Memory corpus artifacts are cached outside the run id at
+`benchmark/tau2/result/memory_corpora/` by default.
 
 ## Quick Start
 
@@ -37,6 +40,21 @@ export TAU2_REPO=/path/to/tau2-bench
 export TAU2_CLI=/path/to/tau2
 ```
 
+The default OpenViking TAU-2 memory evidence protocol is
+`fixed_first_user_full8`: retail + airline, 8 repeats, same seeds, confirmation
+aware user simulator, and fixed first user fixtures for both domains. Later user
+simulator turns remain live. Set the fixture paths before running the default
+configs:
+
+```bash
+export TAU2_RETAIL_FIXED_FIRST_USER_FILE=/path/to/retail/fixed_first_user_fixture.json
+export TAU2_AIRLINE_FIXED_FIRST_USER_FILE=/path/to/airline/fixed_first_user_fixture.json
+```
+
+`--strict-preflight` fails when `eval.require_fixed_first_user=true` and either
+fixture is missing. Use `config/official.yaml` for an explicit non-fixed,
+official-live-user control.
+
 For a local one-command setup, clone and install TAU-2 into ignored benchmark
 directories:
 
@@ -77,6 +95,18 @@ benchmark/tau2/run_full_eval.sh \
   --repeat-count 1
 ```
 
+Plan a one-cell trajectory memory smoke:
+
+```bash
+benchmark/tau2/run_full_eval.sh \
+  --config benchmark/tau2/config/trajectory.yaml \
+  --domain retail \
+  --strategy-id memory_v2_trajectory_view \
+  --num-tasks 1 \
+  --train-num-tasks 1 \
+  --repeat-count 1
+```
+
 Run the Memory V2 8-trial matrix (`retail + airline` x 2 strategies x 8 repeats):
 
 ```bash
@@ -85,6 +115,42 @@ benchmark/tau2/run_full_eval.sh \
   --execute
 ```
 
+## Reproduce the PR-B Evidence
+
+The PR-B headline and content-shape ablation use
+`config/prb_content_matrix_new_prompt.yaml`. It runs the no-memory control plus
+trajectory top4, experience top2, and representative 4000-character budget
+ablation routes across `retail + airline` with 8 repeats.
+
+First run one tiny end-to-end smoke against a clean local OpenViking service:
+
+```bash
+benchmark/tau2/run_full_eval.sh \
+  --config benchmark/tau2/config/prb_content_matrix_new_prompt.yaml \
+  --domain retail \
+  --strategy-id new_traj_fixed_first_user_prewrite \
+  --num-tasks 1 \
+  --train-num-tasks 1 \
+  --repeat-count 1 \
+  --strict-preflight \
+  --execute
+```
+
+Then run the full PR-B matrix:
+
+```bash
+benchmark/tau2/run_full_eval.sh \
+  --config benchmark/tau2/config/prb_content_matrix_new_prompt.yaml \
+  --run-id prb_content_matrix_new_prompt_full8 \
+  --strict-preflight \
+  --execute
+```
+
+The main result is written to
+`benchmark/tau2/result/prb_content_matrix_new_prompt_full8/scoreboard.json`.
+Per-cell outputs live under `cell_results/`; corpus identity and generated
+memory checks live under `memory_corpora/`.
+
 For a small E2E smoke, keep both the eval and train slices tiny:
 
 ```bash
@@ -103,21 +169,54 @@ and `OPENAI_API_BASE` for LiteLLM before running upstream TAU-2.
 
 Start the OpenViking service before executing memory cells, and verify it with
 `ov status`. For evidence runs, use a clean OpenViking workspace/config and set
-`OPENVIKING_URL` explicitly so local custom memory templates do not pollute the
-Memory V2 baseline.
+`OPENVIKING_URL` explicitly so local template overrides do not pollute the
+Memory V2 baseline. For trajectory memory evidence, start the service from this
+branch and inspect generated trajectory files; changing `search_uri` alone does
+not prove the new trajectory prompt was used.
 
 ## Memory Adapter
 
-`memory_v2_experience_only` and `memory_v2_prewrite` cells run through a small
-TAU-2 agent adapter in this directory:
+Memory V2 cells run through a small TAU-2 agent adapter in this directory:
 
 - train by writing TAU-2 training conversations into OpenViking sessions;
-- evaluate by retrieving OpenViking experience memory at the first user turn;
+- evaluate by retrieving OpenViking memory at the first user turn;
 - for pre-write recall, retrieve again before write-like tool calls and
-  regenerate that step with the matched memories;
+  regenerate that step with the matched memories. The default benchmark
+  retrieves 6 pre-write candidates and injects 2, which keeps extra candidates
+  visible in traces without expanding the prompt budget;
+- optionally run an explicit scope-prompt treatment that keeps retrieved
+  memories advisory and asks the agent to preserve the current task scope before
+  write-like tool calls. Configs provide per-domain files through
+  `scope_prompt_files`;
 - emit artifact metadata to identify the OpenViking account, agent,
   corpus, retrieval mode, and simulator policy used by each cell.
 
+For exploratory gates, prefer a bounded run with `--cell-timeout-seconds`.
+Timed-out cells are recorded with return code `124`, `timed_out=true`, and are
+excluded from scoreboard metrics, which keeps smoke runs from silently becoming
+long-running evidence jobs.
+
+The existing `train_memory_mode: experience_only` value selects the Memory V2
+session-commit path. `search_memory_type` selects which generated memory bucket
+is retrieved during eval (`experiences` by default, `trajectories` for
+`config/trajectory.yaml`). The runner prepares each distinct
+`domain + corpus_id` once and reuses it across eval run ids when the cached
+`corpus_manifest.json` is present. Different corpora may be prepared in
+parallel with `benchmark.corpus_prepare_concurrency`; session commits inside one
+corpus remain serial to preserve OpenViking write semantics.
+
+By default, trajectory extraction is transcript-only: the runner replays TAU-2
+messages into an OpenViking session and does not expose held-out reward or
+assertion results to the extractor. The PR-B evidence config can also use a
+structured role/tool transcript, include the domain policy in the training
+session, skip failed train sessions when building positive procedure memory, and
+cap injected memory by total character budget for content-shape ablations.
+
+Eval cells run in parallel with `benchmark.strategy_concurrency` by default and
+can be overridden with `--strategy-concurrency`. This only parallelizes read-only
+TAU-2 eval cells; corpus writes inside one corpus are still serialized by the
+prepare step.
+
 ## User Simulator Policy
 
 The runner default is the official TAU-2 user simulator if
@@ -131,6 +230,14 @@ confirmation boundary to the TAU-2 user simulator guidelines; metadata such as
 the upstream PR link is kept in run artifacts, not in the simulator prompt.
 Reference: [sierra-research/tau2-bench#297](https://github.com/sierra-research/tau2-bench/pull/297).
 
+Optional fixed-first-user fixtures keep the first simulated user turn stable
+while preserving live simulator behavior after that turn:
+
+```bash
+export TAU2_RETAIL_FIXED_FIRST_USER_FILE=/path/to/retail_fixture.json
+export TAU2_AIRLINE_FIXED_FIRST_USER_FILE=/path/to/airline_fixture.json
+```
+
 Use `config/official.yaml` with a clean TAU-2 checkout when you need an
 official-user-simulator parity run. If the checkout was already patched, the
 artifact records that boundary instead of labeling the run pure official.

diff --git a/benchmark/tau2/config/baseline.yaml b/benchmark/tau2/config/baseline.yaml
@@ -6,7 +6,9 @@ benchmark:
   train_split_name: train
   eval_split_name: test
   repeat_count: 8
+  strategy_concurrency: 8
   task_max_concurrency: 10
+  corpus_prepare_concurrency: 2
   max_steps: 200
   seed: 300
   agent: llm_agent
@@ -17,12 +19,25 @@ paths:
   tau2_repo: ${TAU2_REPO:-data/external_benchmarks/tau2-bench}
   tau2_cli: ${TAU2_CLI:-tau2}
   output_dir: benchmark/tau2/result
+  # Corpus writes are expensive and should be reused across eval run ids when
+  # the train split and memory prompt/config did not change.
+  corpus_cache_dir: benchmark/tau2/result/memory_corpora
 
 eval:
+  # Default OpenViking TAU-2 memory evidence uses the fixed-first-user full8
+  # protocol: retail + airline, 8 repeats, same seeds, first user turn pinned by
+  # fixtures, later user simulator turns still live.
+  protocol: fixed_first_user_full8
+  require_fixed_first_user: true
   # The runner default is official if this field is omitted. The OpenViking
   # memory benchmark config opts into a confirmation-aware TAU-2 user simulator
   # prompt; run_eval.py applies that small prompt patch idempotently when needed.
   user_simulator_policy: confirmation_aware
+  # Fixed-first-user fixtures keep the first simulated user turn stable while
+  # leaving later turns live. Main PR-B evidence requires these env vars.
+  fixed_first_user_fixtures:
+    retail: ${TAU2_RETAIL_FIXED_FIRST_USER_FILE:-}
+    airline: ${TAU2_AIRLINE_FIXED_FIRST_USER_FILE:-}
 
 model:
   agent_llm: ${TAU2_AGENT_LLM:-openai/doubao-seed-2-0-pro-260215}
@@ -33,7 +48,10 @@ openviking:
   url: ${OPENVIKING_URL:-http://localhost:1933}
   account: ${OPENVIKING_ACCOUNT:-default}
   agent_id: ${OPENVIKING_AGENT_ID:-tau2-openviking-agent}
+  reuse_corpus_across_runs: true
   retrieval_top_k: 4
+  prewrite_retrieval_top_k: 6
+  prewrite_inject_top_k: 2
   replay_write_policy: read_only
 
 strategies:

diff --git a/benchmark/tau2/config/no_memory.yaml b/benchmark/tau2/config/no_memory.yaml
@@ -0,0 +1,9 @@
+extends: baseline.yaml
+
+benchmark:
+  name: tau2_openviking_no_memory
+
+strategies:
+  - id: no_memory
+    label: TAU-2 no-memory baseline
+    memory_backend: none
diff --git a/benchmark/tau2/config/official.yaml b/benchmark/tau2/config/official.yaml
@@ -4,4 +4,9 @@ benchmark:
   name: tau2_openviking_official_user_simulator
 
 eval:
+  protocol: official_live_user
+  require_fixed_first_user: false
   user_simulator_policy: official
+  fixed_first_user_fixtures:
+    retail:
+    airline: