Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
53 commits
Select commit Hold shift + click to select a range
d1caa77
feat(benchmark): add TAU-2 trajectory memory treatment
May 13, 2026
a68d5e7
style(benchmark): format tau2 trajectory scripts
May 13, 2026
4391dd4
feat(benchmark): add tau2 trajectory category rerank
May 13, 2026
0700781
Merge remote-tracking branch 'origin/main' into feat/tau2-trajectory-…
May 13, 2026
fddd7ba
refine trajectory memory view prompt
May 13, 2026
5f8ea0b
Merge remote-tracking branch 'pr-b/feat-tau2-trajectory-memory' into …
May 13, 2026
7cd7acd
test(benchmark): cover tau2 category rerank helper
May 13, 2026
0496000
feat(benchmark): prepare tau2 memory corpora before eval
May 13, 2026
8e3dc60
align tau2 category rerank with harness baseline
May 13, 2026
f7b3815
bench(tau2): align category rerank with FGMemory route
May 13, 2026
08d33a9
fix(benchmark): tighten trajectory evidence prompt
May 13, 2026
9d2719a
Merge remote-tracking branch 'fork/feat/tau2-trajectory-memory' into …
May 13, 2026
1bdc6a9
bench(tau2): resolve memory eval artifact paths
May 13, 2026
9cfe362
fix(benchmark): guard tau2 infrastructure failures
May 13, 2026
02125f7
Merge commit '9cfe362721cead6f5eaac7e0b6d5a3ada6580682' into codex/ta…
May 13, 2026
dc12e32
bench(tau2): support category annotation sidecars
huangruiteng May 13, 2026
7c88bc4
Revert "bench(tau2): support category annotation sidecars"
huangruiteng May 13, 2026
91f9edf
docs(tau2): clarify self-generated category signals
huangruiteng May 13, 2026
2b767f2
fix(benchmark): guard tau2 infrastructure failures
huangruiteng May 13, 2026
a624451
bench(tau2): summarize category trace coverage
huangruiteng May 13, 2026
63a1004
fix(benchmark): resolve tau2 runner paths
huangruiteng May 13, 2026
05932dd
docs(tau2): document category trace summary
huangruiteng May 13, 2026
fb62c46
fix(memory): add trajectory evidence examples
huangruiteng May 13, 2026
b613e3e
bench(tau2): align trajectory baseline guard
huangruiteng May 13, 2026
cc1f009
bench(tau2): report concrete memory trace coverage
huangruiteng May 13, 2026
c0aa47a
bench(tau2): gate diagnostic category evidence
huangruiteng May 13, 2026
005627f
bench(tau2): tighten category coverage diagnostics
huangruiteng May 13, 2026
96af302
bench(tau2): expose aggregate-only corpus probes
huangruiteng May 13, 2026
1e96d6c
bench(tau2): gate aggregate-only corpus probes
huangruiteng May 13, 2026
e30b79b
fix(benchmark): run no-memory tau2 eval in process
huangruiteng May 13, 2026
15f66e9
bench(tau2): align corpus probe width with rerank
huangruiteng May 13, 2026
88505d7
bench(tau2): align no-memory runner with PR-B
huangruiteng May 13, 2026
1977673
test(tau2): guard S89 category alignment
huangruiteng May 13, 2026
b4e1531
bench(tau2): require applied category runtime evidence
huangruiteng May 13, 2026
3be6924
bench(tau2): summarize diagnostic evidence reasons
huangruiteng May 14, 2026
7a8f078
bench(tau2): distinguish category coverage from match
huangruiteng May 14, 2026
9b5e93a
Merge remote-tracking branch 'fork/feat/tau2-trajectory-memory' into …
huangruiteng May 14, 2026
c5ac25d
bench(tau2): require concrete category injection evidence
huangruiteng May 14, 2026
7b6dc92
bench(tau2): require injected concrete category match
huangruiteng May 14, 2026
662cf0b
bench(tau2): align retrieval budget and fixed first user
huangruiteng May 14, 2026
f85d60b
bench(tau2): reuse memory corpora across eval runs
huangruiteng May 14, 2026
68e4b4c
Merge branch 'feat/tau2-trajectory-memory' into codex/tau2-category-r…
huangruiteng May 14, 2026
436e2a4
bench(tau2): add custom S84 category runner
huangruiteng May 14, 2026
d833980
bench(tau2): add scoped trajectory eval concurrency
huangruiteng May 14, 2026
139f7a9
Merge commit 'd833980e' into codex/tau2-category-rerank-on-pr-b
huangruiteng May 14, 2026
74c18db
style(benchmark): format tau2 eval runner
huangruiteng May 14, 2026
fe85652
Merge branch 'feat/tau2-trajectory-memory' into codex/tau2-category-r…
huangruiteng May 14, 2026
b8884cf
style(benchmark): format tau2 category rerank
huangruiteng May 14, 2026
8648c5d
bench(tau2): add first-user category diagnostic config
huangruiteng May 14, 2026
8cf7737
style(benchmark): satisfy tau2 eval lint
huangruiteng May 14, 2026
9c9346e
merge: sync category rerank with PR-B lint fix
huangruiteng May 14, 2026
78be066
bench(tau2): keep category rerank on trajectory memory
huangruiteng May 15, 2026
26a28da
bench(tau2): add first-user category rerank variants
huangruiteng May 15, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
119 changes: 109 additions & 10 deletions benchmark/tau2/README.md
Original file line number Diff line number Diff line change
@@ -1,30 +1,38 @@
# TAU-2 Benchmark

This directory contains a small OpenViking-style entry point for TAU-2 memory
evaluation. The first version is intentionally narrow:
evaluation. The scope is intentionally narrow:

- fresh OpenViking Memory V2 experience-only baseline;
- Memory V2 pre-write recall treatment.
- trajectory-view retrieval treatment for the refined trajectory prompt;
- experimental category-aware pre-write rerank on top of trajectory-view
memory.

Trajectory / procedure-view prompts, category rerank, and other harness-only
diagnostics are intentionally left out of this first PR.
The category-aware route is opt-in and experimental; it is meant for PR-C review
and smoke/targeted probes before any productization decision.

## Layout

```text
benchmark/tau2/
├── config/
│ ├── baseline.yaml
│ ├── category_rerank.yaml
│ ├── no_memory.yaml
│ ├── official.yaml
│ └── prewrite.yaml
│ ├── prewrite.yaml
│ └── trajectory.yaml
├── scripts/
│ ├── run_eval.py
│ ├── setup_tau2_repo.sh
│ └── tau2_common.py
└── run_full_eval.sh
```

Generated artifacts are written to `benchmark/tau2/result/<run_id>/`.
Generated eval artifacts are written to `benchmark/tau2/result/<run_id>/`.
Memory corpus artifacts are cached outside the run id at
`benchmark/tau2/result/memory_corpora/` by default.

## Quick Start

Expand All @@ -51,6 +59,10 @@ Plan the default benchmark without running TAU-2:
python benchmark/tau2/scripts/run_eval.py --config benchmark/tau2/config/baseline.yaml --plan-only
```

Use `config/no_memory.yaml` for same-runner no-memory baselines; it executes
through the Python wrapper so artifacts and result validation match the memory
cells.

Add `--preflight` or `--strict-preflight` when you want the runner to write a
small environment/config check next to the run plan.

Expand All @@ -77,6 +89,30 @@ benchmark/tau2/run_full_eval.sh \
--repeat-count 1
```

Plan a one-cell trajectory-view smoke:

```bash
benchmark/tau2/run_full_eval.sh \
--config benchmark/tau2/config/trajectory.yaml \
--domain retail \
--strategy-id memory_v2_trajectory_view \
--num-tasks 1 \
--train-num-tasks 1 \
--repeat-count 1
```

Plan a one-cell trajectory category-rerank smoke:

```bash
benchmark/tau2/run_full_eval.sh \
--config benchmark/tau2/config/category_rerank.yaml \
--domain retail \
--strategy-id memory_v2_trajectory_category_prewrite \
--num-tasks 1 \
--train-num-tasks 1 \
--repeat-count 1
```

Run the Memory V2 8-trial matrix (`retail + airline` x 2 strategies x 8 repeats):

```bash
Expand Down Expand Up @@ -104,20 +140,75 @@ and `OPENAI_API_BASE` for LiteLLM before running upstream TAU-2.
Start the OpenViking service before executing memory cells, and verify it with
`ov status`. For evidence runs, use a clean OpenViking workspace/config and set
`OPENVIKING_URL` explicitly so local custom memory templates do not pollute the
Memory V2 baseline.
Memory V2 baseline. For trajectory-view evidence, start the service from this
branch and inspect generated trajectory files; changing `search_uri` alone does
not prove the new trajectory prompt was used.

## Memory Adapter

`memory_v2_experience_only` and `memory_v2_prewrite` cells run through a small
TAU-2 agent adapter in this directory:
Memory V2 cells run through a small TAU-2 agent adapter in this directory:

- train by writing TAU-2 training conversations into OpenViking sessions;
- evaluate by retrieving OpenViking experience memory at the first user turn;
- evaluate by retrieving OpenViking memory at the first user turn;
- for pre-write recall, retrieve again before write-like tool calls and
regenerate that step with the matched memories;
regenerate that step with the matched memories. The default benchmark
retrieves 6 pre-write candidates and injects 2, which keeps extra candidates
visible in traces without expanding the prompt budget;
- optionally run an explicit scope-prompt treatment that keeps retrieved
memories advisory and asks the agent to preserve the current task scope before
write-like tool calls;
- emit artifact metadata to identify the OpenViking account, agent,
corpus, retrieval mode, and simulator policy used by each cell.

The existing `train_memory_mode: experience_only` value selects the Memory V2
session-commit path. `search_memory_type` selects which generated memory bucket
is retrieved during eval (`experiences` by default, `trajectories` for
`config/trajectory.yaml`). The runner prepares each distinct
`domain + corpus_id` once and reuses it across eval run ids when the cached
`corpus_manifest.json` is present. Different corpora may be prepared in
parallel with `benchmark.corpus_prepare_concurrency`; session commits inside one
corpus remain serial to preserve OpenViking write semantics.

Eval cells run in parallel with `benchmark.strategy_concurrency` by default and
can be overridden with `--strategy-concurrency`. This only parallelizes read-only
TAU-2 eval cells; corpus writes inside one corpus are still serialized by the
prepare step.

`config/category_rerank.yaml` keeps the PR-B trajectory memory route and enables
an adapter-local category-rerank probe: pre-write recall, LLM-generated category
annotation sidecars, and the same scope prompt shape used by the trajectory-view
evidence runs. The category treatment retrieves 6 candidates, keeps positive
category matches, injects at most 2 memories, skips injection when no positive
category match exists, and applies the scope/applicability prompt at the system
prompt injection point. Runtime category rerank is sidecar-only:
the runner looks up query and memory annotations from configured
`annotation_files`; if either side is missing, the cell fails instead of doing
live query-to-category mapping. Retrieval traces include
the query category, candidate memory categories, rerank reasons, selected rows,
skipped rows, scope prompt metadata,
and flat `*_category*_prompt` fields kept compatible with Harness diagnostics.
Each run summary also includes `retrieval_trace_summary`, a compact rollup of
decision nodes, category decisions, query/memory category sources, selected
category coverage, positive query-to-memory category-match coverage,
aggregate-vs-concrete memory candidate coverage, and write tool calls. Use it
as the first check that a run is using this branch's self-generated category
signal before opening the JSONL trace. Category runs whose runtime trace has
only aggregate `.overview.md` / `.abstract.md` candidates, no applied
category-rerank event, no query or memory category coverage, no positive
query-to-memory category match, no actual memory injection, no injected
concrete memory, no injected concrete positive category match, or no selected
positive category match are marked `runtime_evidence.status=diagnostic`;
`scoreboard.json` excludes those diagnostic cells from the main reward/DB
aggregates while preserving their metrics, artifacts, and
`diagnostic_reason_counts` for debugging. Corpus
manifests also include
`corpus_probe.aggregate_match_count` and `corpus_probe.concrete_match_count` so
aggregate-only corpora can be spotted before reading the eval trace; category
runs whose corpus probe is empty, or has matches but no concrete matches, are
also marked diagnostic. The corpus probe uses the category `retrieve_limit`
when category rerank is enabled, so the probe width matches the runtime
pre-write search width.

## User Simulator Policy

The runner default is the official TAU-2 user simulator if
Expand All @@ -131,6 +222,14 @@ confirmation boundary to the TAU-2 user simulator guidelines; metadata such as
the upstream PR link is kept in run artifacts, not in the simulator prompt.
Reference: [sierra-research/tau2-bench#297](https://github.com/sierra-research/tau2-bench/pull/297).

Optional fixed-first-user fixtures keep the first simulated user turn stable
while preserving live simulator behavior after that turn:

```bash
export TAU2_RETAIL_FIXED_FIRST_USER_FILE=/path/to/retail_fixture.json
export TAU2_AIRLINE_FIXED_FIRST_USER_FILE=/path/to/airline_fixture.json
```

Use `config/official.yaml` with a clean TAU-2 checkout when you need an
official-user-simulator parity run. If the checkout was already patched, the
artifact records that boundary instead of labeling the run pure official.
Expand Down
13 changes: 13 additions & 0 deletions benchmark/tau2/config/baseline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,9 @@ benchmark:
train_split_name: train
eval_split_name: test
repeat_count: 8
strategy_concurrency: 8
task_max_concurrency: 10
corpus_prepare_concurrency: 2
max_steps: 200
seed: 300
agent: llm_agent
Expand All @@ -17,12 +19,20 @@ paths:
tau2_repo: ${TAU2_REPO:-data/external_benchmarks/tau2-bench}
tau2_cli: ${TAU2_CLI:-tau2}
output_dir: benchmark/tau2/result
# Corpus writes are expensive and should be reused across eval run ids when
# the train split and memory prompt/config did not change.
corpus_cache_dir: benchmark/tau2/result/memory_corpora

eval:
# The runner default is official if this field is omitted. The OpenViking
# memory benchmark config opts into a confirmation-aware TAU-2 user simulator
# prompt; run_eval.py applies that small prompt patch idempotently when needed.
user_simulator_policy: confirmation_aware
# Optional fixed-first-user fixtures keep the first simulated user turn stable
# while leaving later turns live. Set these env vars to fixture JSON files.
fixed_first_user_fixtures:
retail: ${TAU2_RETAIL_FIXED_FIRST_USER_FILE:-}
airline: ${TAU2_AIRLINE_FIXED_FIRST_USER_FILE:-}

model:
agent_llm: ${TAU2_AGENT_LLM:-openai/doubao-seed-2-0-pro-260215}
Expand All @@ -33,7 +43,10 @@ openviking:
url: ${OPENVIKING_URL:-http://localhost:1933}
account: ${OPENVIKING_ACCOUNT:-default}
agent_id: ${OPENVIKING_AGENT_ID:-tau2-openviking-agent}
reuse_corpus_across_runs: true
retrieval_top_k: 4
prewrite_retrieval_top_k: 6
prewrite_inject_top_k: 2
replay_write_policy: read_only

strategies:
Expand Down
136 changes: 136 additions & 0 deletions benchmark/tau2/config/category_rerank.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,136 @@
extends: trajectory.yaml

benchmark:
name: tau2_openviking_trajectory_category_rerank
domains:
- retail
- airline

x-trajectory-category-sidecars: &trajectory_category_sidecars
retail:
- ${TAU2_RETAIL_TRAJECTORY_FIRST_USER_QUERY_CATEGORY_ANNOTATIONS:-benchmark/tau2/result/category_annotations/tau2_pr_b_trajectory_view_retail_first_user_query_workflow_c1_20260516_merged/annotations.jsonl}
- ${TAU2_RETAIL_TRAJECTORY_QUERY_CATEGORY_ANNOTATIONS:-benchmark/tau2/result/category_annotations/tau2_pr_b_trajectory_view_retail_prewrite_query_annotations_workflow_c1_v2_20260515/annotations.jsonl}
- ${TAU2_RETAIL_TRAJECTORY_MEMORY_CATEGORY_ANNOTATIONS:-benchmark/tau2/result/category_annotations/tau2_pr_b_trajectory_view_retail_memory_workflow_c1_seed5_warm12_full_20260515_merged_memory_annotations/annotations.jsonl}
airline:
- ${TAU2_AIRLINE_TRAJECTORY_FIRST_USER_QUERY_CATEGORY_ANNOTATIONS:-benchmark/tau2/result/category_annotations/tau2_pr_b_trajectory_view_airline_first_user_query_workflow_c1_20260516_merged/annotations.jsonl}
- ${TAU2_AIRLINE_TRAJECTORY_MEMORY_CATEGORY_ANNOTATIONS:-benchmark/tau2/result/category_annotations/tau2_pr_b_trajectory_view_airline_memory_workflow_c1_warm6_full_20260516_merged_memory_annotations/annotations.jsonl}

x-trajectory-scope-prompt: &trajectory_scope_prompt
enabled: true
injection_point: system_prompt
domain_files:
retail: benchmark/tau2/config/scope_prompts/retail_memory_scope.md
airline: benchmark/tau2/config/scope_prompts/airline_memory_scope.md

x-prewrite-category-base: &prewrite_category_base
enabled: true
annotation_files: *trajectory_category_sidecars
apply_nodes:
- before_write_tool_call
retrieve_limit: 6
inject_limit: 2
positive_match_required: true
no_match_policy: skip_injection
missing_query_policy: base_rank
search_score_weight: 0.0

x-first-user-category-base: &first_user_category_base
enabled: true
annotation_files: *trajectory_category_sidecars
apply_nodes:
- first_user
retrieve_limit: 6
inject_limit: 2
positive_match_required: true
no_match_policy: skip_injection
missing_query_policy: fail_fast
search_score_weight: 0.0

strategies:
- id: memory_v2_trajectory_prewrite_scope
label: OpenViking Memory V2 trajectory-view pre-write recall with scope prompt
memory_backend: openviking
train_required: true
corpus_id: memory_v2_trajectory_view
train_memory_mode: experience_only
search_memory_type: trajectories
retrieval_mode: first_user_prewrite
scope_prompt: *trajectory_scope_prompt

- id: memory_v2_trajectory_category_prewrite_exact
label: OpenViking Memory V2 trajectory-view scope + pre-write exact-pair category rerank
memory_backend: openviking
train_required: true
corpus_id: memory_v2_trajectory_view
train_memory_mode: experience_only
search_memory_type: trajectories
retrieval_mode: first_user_prewrite
category_rerank:
<<: *prewrite_category_base
mismatch_policy: keep_positive_match_drop_mismatch
scope_prompt: *trajectory_scope_prompt

- id: memory_v2_trajectory_category_prewrite_priority
label: OpenViking Memory V2 trajectory-view scope + pre-write category priority fill
memory_backend: openviking
train_required: true
corpus_id: memory_v2_trajectory_view
train_memory_mode: experience_only
search_memory_type: trajectories
retrieval_mode: first_user_prewrite
category_rerank:
<<: *prewrite_category_base
mismatch_policy: positive_priority_fill
scope_prompt: *trajectory_scope_prompt

- id: memory_v2_trajectory_category_prewrite_strict_pair
label: OpenViking Memory V2 trajectory-view scope + pre-write strict pair-only category rerank
memory_backend: openviking
train_required: true
corpus_id: memory_v2_trajectory_view
train_memory_mode: experience_only
search_memory_type: trajectories
retrieval_mode: first_user_prewrite
category_rerank:
<<: *prewrite_category_base
mismatch_policy: strict_pair_match_only
scope_prompt: *trajectory_scope_prompt

- id: memory_v2_trajectory_category_first_user_exact
label: OpenViking Memory V2 trajectory-view scope + first-user exact-pair category rerank
memory_backend: openviking
train_required: true
corpus_id: memory_v2_trajectory_view
train_memory_mode: experience_only
search_memory_type: trajectories
retrieval_mode: first_user_prewrite
category_rerank:
<<: *first_user_category_base
mismatch_policy: keep_positive_match_drop_mismatch
scope_prompt: *trajectory_scope_prompt

- id: memory_v2_trajectory_category_first_user_priority
label: OpenViking Memory V2 trajectory-view scope + first-user category priority fill
memory_backend: openviking
train_required: true
corpus_id: memory_v2_trajectory_view
train_memory_mode: experience_only
search_memory_type: trajectories
retrieval_mode: first_user_prewrite
category_rerank:
<<: *first_user_category_base
mismatch_policy: positive_priority_fill
scope_prompt: *trajectory_scope_prompt

- id: memory_v2_trajectory_category_first_user_strict_pair
label: OpenViking Memory V2 trajectory-view scope + first-user strict pair-only category rerank
memory_backend: openviking
train_required: true
corpus_id: memory_v2_trajectory_view
train_memory_mode: experience_only
search_memory_type: trajectories
retrieval_mode: first_user_prewrite
category_rerank:
<<: *first_user_category_base
mismatch_policy: strict_pair_match_only
scope_prompt: *trajectory_scope_prompt
9 changes: 9 additions & 0 deletions benchmark/tau2/config/no_memory.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
extends: baseline.yaml

benchmark:
name: tau2_openviking_no_memory

strategies:
- id: no_memory
label: TAU-2 no-memory baseline
memory_backend: none
18 changes: 18 additions & 0 deletions benchmark/tau2/config/scope_prompts/airline_memory_scope.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
<openviking_memory_scope_guard>
OpenViking memories are advisory. Use them only when their trigger, preconditions,
and applicability boundary match the current airline task.

- Do not broaden the user's requested booking, cancellation, rebooking, flight
update, passenger update, baggage update, insurance, or payment scope because a
retrieved memory describes a nearby workflow.
- Keep the current reservation scope explicit. Only use flights, passengers,
baggage entries, cabin changes, insurance choices, payment IDs, dates, and
amounts that are grounded in user input, recent tool observations, reservation
state, profile/payment state, or an explicit search/lookup result.
- Before a write tool call, verify that the selected write action matches the
user's requested operation. Do not mix cancellation, rebooking, upgrade,
downgrade, baggage, or passenger-update flows unless the current task asks for
that combined operation.
- If a memory and the current task disagree, follow the current task state and the
domain policy.
</openviking_memory_scope_guard>
Loading
Loading