volcengine · huangruiteng · May 13, 2026 · May 13, 2026 · May 13, 2026 · May 13, 2026
diff --git a/benchmark/tau2/README.md b/benchmark/tau2/README.md
@@ -1,30 +1,38 @@
 # TAU-2 Benchmark
 
 This directory contains a small OpenViking-style entry point for TAU-2 memory
-evaluation. The first version is intentionally narrow:
+evaluation. The scope is intentionally narrow:
 
 - fresh OpenViking Memory V2 experience-only baseline;
 - Memory V2 pre-write recall treatment.
+- trajectory-view retrieval treatment for the refined trajectory prompt;
+- experimental category-aware pre-write rerank on top of trajectory-view
+  memory.
 
-Trajectory / procedure-view prompts, category rerank, and other harness-only
-diagnostics are intentionally left out of this first PR.
+The category-aware route is opt-in and experimental; it is meant for PR-C review
+and smoke/targeted probes before any productization decision.
 
 ## Layout
 
 ```text
 benchmark/tau2/
 ├── config/
 │   ├── baseline.yaml
+│   ├── category_rerank.yaml
+│   ├── no_memory.yaml
 │   ├── official.yaml
-│   └── prewrite.yaml
+│   ├── prewrite.yaml
+│   └── trajectory.yaml
 ├── scripts/
 │   ├── run_eval.py
 │   ├── setup_tau2_repo.sh
 │   └── tau2_common.py
 └── run_full_eval.sh
 ```
 
-Generated artifacts are written to `benchmark/tau2/result/<run_id>/`.
+Generated eval artifacts are written to `benchmark/tau2/result/<run_id>/`.
+Memory corpus artifacts are cached outside the run id at
+`benchmark/tau2/result/memory_corpora/` by default.
 
 ## Quick Start
 
@@ -51,6 +59,10 @@ Plan the default benchmark without running TAU-2:
 python benchmark/tau2/scripts/run_eval.py --config benchmark/tau2/config/baseline.yaml --plan-only
 ```
 
+Use `config/no_memory.yaml` for same-runner no-memory baselines; it executes
+through the Python wrapper so artifacts and result validation match the memory
+cells.
+
 Add `--preflight` or `--strict-preflight` when you want the runner to write a
 small environment/config check next to the run plan.
 
@@ -77,6 +89,30 @@ benchmark/tau2/run_full_eval.sh \
   --repeat-count 1
 ```
 
+Plan a one-cell trajectory-view smoke:
+
+```bash
+benchmark/tau2/run_full_eval.sh \
+  --config benchmark/tau2/config/trajectory.yaml \
+  --domain retail \
+  --strategy-id memory_v2_trajectory_view \
+  --num-tasks 1 \
+  --train-num-tasks 1 \
+  --repeat-count 1
+```
+
+Plan a one-cell trajectory category-rerank smoke:
+
+```bash
+benchmark/tau2/run_full_eval.sh \
+  --config benchmark/tau2/config/category_rerank.yaml \
+  --domain retail \
+  --strategy-id memory_v2_trajectory_category_prewrite \
+  --num-tasks 1 \
+  --train-num-tasks 1 \
+  --repeat-count 1
+```
+
 Run the Memory V2 8-trial matrix (`retail + airline` x 2 strategies x 8 repeats):
 
 ```bash
@@ -104,20 +140,75 @@ and `OPENAI_API_BASE` for LiteLLM before running upstream TAU-2.
 Start the OpenViking service before executing memory cells, and verify it with
 `ov status`. For evidence runs, use a clean OpenViking workspace/config and set
 `OPENVIKING_URL` explicitly so local custom memory templates do not pollute the
-Memory V2 baseline.
+Memory V2 baseline. For trajectory-view evidence, start the service from this
+branch and inspect generated trajectory files; changing `search_uri` alone does
+not prove the new trajectory prompt was used.
 
 ## Memory Adapter
 
-`memory_v2_experience_only` and `memory_v2_prewrite` cells run through a small
-TAU-2 agent adapter in this directory:
+Memory V2 cells run through a small TAU-2 agent adapter in this directory:
 
 - train by writing TAU-2 training conversations into OpenViking sessions;
-- evaluate by retrieving OpenViking experience memory at the first user turn;
+- evaluate by retrieving OpenViking memory at the first user turn;
 - for pre-write recall, retrieve again before write-like tool calls and
-  regenerate that step with the matched memories;
+  regenerate that step with the matched memories. The default benchmark
+  retrieves 6 pre-write candidates and injects 2, which keeps extra candidates
+  visible in traces without expanding the prompt budget;
+- optionally run an explicit scope-prompt treatment that keeps retrieved
+  memories advisory and asks the agent to preserve the current task scope before
+  write-like tool calls;
 - emit artifact metadata to identify the OpenViking account, agent,
   corpus, retrieval mode, and simulator policy used by each cell.
 
+The existing `train_memory_mode: experience_only` value selects the Memory V2
+session-commit path. `search_memory_type` selects which generated memory bucket
+is retrieved during eval (`experiences` by default, `trajectories` for
+`config/trajectory.yaml`). The runner prepares each distinct
+`domain + corpus_id` once and reuses it across eval run ids when the cached
+`corpus_manifest.json` is present. Different corpora may be prepared in
+parallel with `benchmark.corpus_prepare_concurrency`; session commits inside one
+corpus remain serial to preserve OpenViking write semantics.
+
+Eval cells run in parallel with `benchmark.strategy_concurrency` by default and
+can be overridden with `--strategy-concurrency`. This only parallelizes read-only
+TAU-2 eval cells; corpus writes inside one corpus are still serialized by the
+prepare step.
+
+`config/category_rerank.yaml` keeps the PR-B trajectory memory route and enables
+an adapter-local category-rerank probe: pre-write recall, LLM-generated category
+annotation sidecars, and the same scope prompt shape used by the trajectory-view
+evidence runs. The category treatment retrieves 6 candidates, keeps positive
+category matches, injects at most 2 memories, skips injection when no positive
+category match exists, and applies the scope/applicability prompt at the system
+prompt injection point. Runtime category rerank is sidecar-only:
+the runner looks up query and memory annotations from configured
+`annotation_files`; if either side is missing, the cell fails instead of doing
+live query-to-category mapping. Retrieval traces include
+the query category, candidate memory categories, rerank reasons, selected rows,
+skipped rows, scope prompt metadata,
+and flat `*_category*_prompt` fields kept compatible with Harness diagnostics.
+Each run summary also includes `retrieval_trace_summary`, a compact rollup of
+decision nodes, category decisions, query/memory category sources, selected
+category coverage, positive query-to-memory category-match coverage,
+aggregate-vs-concrete memory candidate coverage, and write tool calls. Use it
+as the first check that a run is using this branch's self-generated category
+signal before opening the JSONL trace. Category runs whose runtime trace has
+only aggregate `.overview.md` / `.abstract.md` candidates, no applied
+category-rerank event, no query or memory category coverage, no positive
+query-to-memory category match, no actual memory injection, no injected
+concrete memory, no injected concrete positive category match, or no selected
+positive category match are marked `runtime_evidence.status=diagnostic`;
+`scoreboard.json` excludes those diagnostic cells from the main reward/DB
+aggregates while preserving their metrics, artifacts, and
+`diagnostic_reason_counts` for debugging. Corpus
+manifests also include
+`corpus_probe.aggregate_match_count` and `corpus_probe.concrete_match_count` so
+aggregate-only corpora can be spotted before reading the eval trace; category
+runs whose corpus probe is empty, or has matches but no concrete matches, are
+also marked diagnostic. The corpus probe uses the category `retrieve_limit`
+when category rerank is enabled, so the probe width matches the runtime
+pre-write search width.
+
 ## User Simulator Policy
 
 The runner default is the official TAU-2 user simulator if
@@ -131,6 +222,14 @@ confirmation boundary to the TAU-2 user simulator guidelines; metadata such as
 the upstream PR link is kept in run artifacts, not in the simulator prompt.
 Reference: [sierra-research/tau2-bench#297](https://github.com/sierra-research/tau2-bench/pull/297).
 
+Optional fixed-first-user fixtures keep the first simulated user turn stable
+while preserving live simulator behavior after that turn:
+
+```bash
+export TAU2_RETAIL_FIXED_FIRST_USER_FILE=/path/to/retail_fixture.json
+export TAU2_AIRLINE_FIXED_FIRST_USER_FILE=/path/to/airline_fixture.json
+```
+
 Use `config/official.yaml` with a clean TAU-2 checkout when you need an
 official-user-simulator parity run. If the checkout was already patched, the
 artifact records that boundary instead of labeling the run pure official.

diff --git a/benchmark/tau2/config/baseline.yaml b/benchmark/tau2/config/baseline.yaml
@@ -6,7 +6,9 @@ benchmark:
   train_split_name: train
   eval_split_name: test
   repeat_count: 8
+  strategy_concurrency: 8
   task_max_concurrency: 10
+  corpus_prepare_concurrency: 2
   max_steps: 200
   seed: 300
   agent: llm_agent
@@ -17,12 +19,20 @@ paths:
   tau2_repo: ${TAU2_REPO:-data/external_benchmarks/tau2-bench}
   tau2_cli: ${TAU2_CLI:-tau2}
   output_dir: benchmark/tau2/result
+  # Corpus writes are expensive and should be reused across eval run ids when
+  # the train split and memory prompt/config did not change.
+  corpus_cache_dir: benchmark/tau2/result/memory_corpora
 
 eval:
   # The runner default is official if this field is omitted. The OpenViking
   # memory benchmark config opts into a confirmation-aware TAU-2 user simulator
   # prompt; run_eval.py applies that small prompt patch idempotently when needed.
   user_simulator_policy: confirmation_aware
+  # Optional fixed-first-user fixtures keep the first simulated user turn stable
+  # while leaving later turns live. Set these env vars to fixture JSON files.
+  fixed_first_user_fixtures:
+    retail: ${TAU2_RETAIL_FIXED_FIRST_USER_FILE:-}
+    airline: ${TAU2_AIRLINE_FIXED_FIRST_USER_FILE:-}
 
 model:
   agent_llm: ${TAU2_AGENT_LLM:-openai/doubao-seed-2-0-pro-260215}
@@ -33,7 +43,10 @@ openviking:
   url: ${OPENVIKING_URL:-http://localhost:1933}
   account: ${OPENVIKING_ACCOUNT:-default}
   agent_id: ${OPENVIKING_AGENT_ID:-tau2-openviking-agent}
+  reuse_corpus_across_runs: true
   retrieval_top_k: 4
+  prewrite_retrieval_top_k: 6
+  prewrite_inject_top_k: 2
   replay_write_policy: read_only
 
 strategies:

diff --git a/benchmark/tau2/config/category_rerank.yaml b/benchmark/tau2/config/category_rerank.yaml
@@ -0,0 +1,136 @@
+extends: trajectory.yaml
+
+benchmark:
+  name: tau2_openviking_trajectory_category_rerank
+  domains:
+    - retail
+    - airline
+
+x-trajectory-category-sidecars: &trajectory_category_sidecars
+  retail:
+    - ${TAU2_RETAIL_TRAJECTORY_FIRST_USER_QUERY_CATEGORY_ANNOTATIONS:-benchmark/tau2/result/category_annotations/tau2_pr_b_trajectory_view_retail_first_user_query_workflow_c1_20260516_merged/annotations.jsonl}
+    - ${TAU2_RETAIL_TRAJECTORY_QUERY_CATEGORY_ANNOTATIONS:-benchmark/tau2/result/category_annotations/tau2_pr_b_trajectory_view_retail_prewrite_query_annotations_workflow_c1_v2_20260515/annotations.jsonl}
+    - ${TAU2_RETAIL_TRAJECTORY_MEMORY_CATEGORY_ANNOTATIONS:-benchmark/tau2/result/category_annotations/tau2_pr_b_trajectory_view_retail_memory_workflow_c1_seed5_warm12_full_20260515_merged_memory_annotations/annotations.jsonl}
+  airline:
+    - ${TAU2_AIRLINE_TRAJECTORY_FIRST_USER_QUERY_CATEGORY_ANNOTATIONS:-benchmark/tau2/result/category_annotations/tau2_pr_b_trajectory_view_airline_first_user_query_workflow_c1_20260516_merged/annotations.jsonl}
+    - ${TAU2_AIRLINE_TRAJECTORY_MEMORY_CATEGORY_ANNOTATIONS:-benchmark/tau2/result/category_annotations/tau2_pr_b_trajectory_view_airline_memory_workflow_c1_warm6_full_20260516_merged_memory_annotations/annotations.jsonl}
+
+x-trajectory-scope-prompt: &trajectory_scope_prompt
+  enabled: true
+  injection_point: system_prompt
+  domain_files:
+    retail: benchmark/tau2/config/scope_prompts/retail_memory_scope.md
+    airline: benchmark/tau2/config/scope_prompts/airline_memory_scope.md
+
+x-prewrite-category-base: &prewrite_category_base
+  enabled: true
+  annotation_files: *trajectory_category_sidecars
+  apply_nodes:
+    - before_write_tool_call
+  retrieve_limit: 6
+  inject_limit: 2
+  positive_match_required: true
+  no_match_policy: skip_injection
+  missing_query_policy: base_rank
+  search_score_weight: 0.0
+
+x-first-user-category-base: &first_user_category_base
+  enabled: true
+  annotation_files: *trajectory_category_sidecars
+  apply_nodes:
+    - first_user
+  retrieve_limit: 6
+  inject_limit: 2
+  positive_match_required: true
+  no_match_policy: skip_injection
+  missing_query_policy: fail_fast
+  search_score_weight: 0.0
+
+strategies:
+  - id: memory_v2_trajectory_prewrite_scope
+    label: OpenViking Memory V2 trajectory-view pre-write recall with scope prompt
+    memory_backend: openviking
+    train_required: true
+    corpus_id: memory_v2_trajectory_view
+    train_memory_mode: experience_only
+    search_memory_type: trajectories
+    retrieval_mode: first_user_prewrite
+    scope_prompt: *trajectory_scope_prompt
+
+  - id: memory_v2_trajectory_category_prewrite_exact
+    label: OpenViking Memory V2 trajectory-view scope + pre-write exact-pair category rerank
+    memory_backend: openviking
+    train_required: true
+    corpus_id: memory_v2_trajectory_view
+    train_memory_mode: experience_only
+    search_memory_type: trajectories
+    retrieval_mode: first_user_prewrite
+    category_rerank:
+      <<: *prewrite_category_base
+      mismatch_policy: keep_positive_match_drop_mismatch
+    scope_prompt: *trajectory_scope_prompt
+
+  - id: memory_v2_trajectory_category_prewrite_priority
+    label: OpenViking Memory V2 trajectory-view scope + pre-write category priority fill
+    memory_backend: openviking
+    train_required: true
+    corpus_id: memory_v2_trajectory_view
+    train_memory_mode: experience_only
+    search_memory_type: trajectories
+    retrieval_mode: first_user_prewrite
+    category_rerank:
+      <<: *prewrite_category_base
+      mismatch_policy: positive_priority_fill
+    scope_prompt: *trajectory_scope_prompt
+
+  - id: memory_v2_trajectory_category_prewrite_strict_pair
+    label: OpenViking Memory V2 trajectory-view scope + pre-write strict pair-only category rerank
+    memory_backend: openviking
+    train_required: true
+    corpus_id: memory_v2_trajectory_view
+    train_memory_mode: experience_only
+    search_memory_type: trajectories
+    retrieval_mode: first_user_prewrite
+    category_rerank:
+      <<: *prewrite_category_base
+      mismatch_policy: strict_pair_match_only
+    scope_prompt: *trajectory_scope_prompt
+
+  - id: memory_v2_trajectory_category_first_user_exact
+    label: OpenViking Memory V2 trajectory-view scope + first-user exact-pair category rerank
+    memory_backend: openviking
+    train_required: true
+    corpus_id: memory_v2_trajectory_view
+    train_memory_mode: experience_only
+    search_memory_type: trajectories
+    retrieval_mode: first_user_prewrite
+    category_rerank:
+      <<: *first_user_category_base
+      mismatch_policy: keep_positive_match_drop_mismatch
+    scope_prompt: *trajectory_scope_prompt
+
+  - id: memory_v2_trajectory_category_first_user_priority
+    label: OpenViking Memory V2 trajectory-view scope + first-user category priority fill
+    memory_backend: openviking
+    train_required: true
+    corpus_id: memory_v2_trajectory_view
+    train_memory_mode: experience_only
+    search_memory_type: trajectories
+    retrieval_mode: first_user_prewrite
+    category_rerank:
+      <<: *first_user_category_base
+      mismatch_policy: positive_priority_fill
+    scope_prompt: *trajectory_scope_prompt
+
+  - id: memory_v2_trajectory_category_first_user_strict_pair
+    label: OpenViking Memory V2 trajectory-view scope + first-user strict pair-only category rerank
+    memory_backend: openviking
+    train_required: true
+    corpus_id: memory_v2_trajectory_view
+    train_memory_mode: experience_only
+    search_memory_type: trajectories
+    retrieval_mode: first_user_prewrite
+    category_rerank:
+      <<: *first_user_category_base
+      mismatch_policy: strict_pair_match_only
+    scope_prompt: *trajectory_scope_prompt
diff --git a/benchmark/tau2/config/no_memory.yaml b/benchmark/tau2/config/no_memory.yaml
@@ -0,0 +1,9 @@
+extends: baseline.yaml
+
+benchmark:
+  name: tau2_openviking_no_memory
+
+strategies:
+  - id: no_memory
+    label: TAU-2 no-memory baseline
+    memory_backend: none
diff --git a/benchmark/tau2/config/scope_prompts/airline_memory_scope.md b/benchmark/tau2/config/scope_prompts/airline_memory_scope.md
@@ -0,0 +1,18 @@
+<openviking_memory_scope_guard>
+OpenViking memories are advisory. Use them only when their trigger, preconditions,
+and applicability boundary match the current airline task.
+
+- Do not broaden the user's requested booking, cancellation, rebooking, flight
+  update, passenger update, baggage update, insurance, or payment scope because a
+  retrieved memory describes a nearby workflow.
+- Keep the current reservation scope explicit. Only use flights, passengers,
+  baggage entries, cabin changes, insurance choices, payment IDs, dates, and
+  amounts that are grounded in user input, recent tool observations, reservation
+  state, profile/payment state, or an explicit search/lookup result.
+- Before a write tool call, verify that the selected write action matches the
+  user's requested operation. Do not mix cancellation, rebooking, upgrade,
+  downgrade, baggage, or passenger-update flows unless the current task asks for
+  that combined operation.
+- If a memory and the current task disagree, follow the current task state and the
+  domain policy.
+</openviking_memory_scope_guard>