bench(tau2): add category-aware trajectory memory rerank by huangruiteng · Pull Request #2079 · volcengine/OpenViking

huangruiteng · 2026-05-15T19:21:05Z

Description

Follow-up to #2017. This PR adds an opt-in TAU-2 benchmark treatment for category-aware reranking on top of OpenViking trajectory-view memory.

It keeps the benchmark path OpenViking-native:

train with train_memory_mode: experience_only;
retrieve from memories/trajectories;
add sidecar-based query / memory category annotations;
before write-like tool calls, retrieve 6 candidates, category-rerank them, and inject at most 2 memories;
emit runtime trace summaries and mark cells diagnostic when category/runtime evidence is missing.

The benchmark trains/evaluates its own OpenViking corpus through the config; it does not require a pre-existing external corpus.

Status

Local TAU-2 evidence is from reasoning-high, fixed-first-user, 8-repeat runs. The headline is that trajectory-view + category rerank gives a clear airline lift, while the no-category pre-write route is currently stronger on retail.

Route	retail avg	airline avg	domain avg	Read
No memory baseline	0.84688	0.75000	0.79844	Current paired OV-native baseline
Trajectory-view + pre-write/scope	0.85938	0.74375	0.80157	Best current retail route
Trajectory-view + pre-write + category rerank	0.83437	0.81250	0.82344	Clearest category positive signal, mainly airline
Best by domain	0.85938	0.81250	0.83594	Retail uses pre-write/scope; airline uses category rerank

Ablation context:

Treatment	retail avg	airline avg	aggregate	Read
Old trajectory prompt -> experiences, first-user only	0.83125	0.66875	0.77708 task-weighted	Negative control; not positive evidence
PR-B trajectory prompt, first-user only	0.82182	0.76875	0.79528 domain avg	Prompt route alone is close to baseline but not enough
PR-B trajectory prompt + pre-write/scope	0.85938	0.74375	0.80157 domain avg	Retail benefits from pre-write/scope
PR-C category rerank	0.83437	0.81250	0.82344 domain avg	Airline benefits from category selection

This PR is still a draft because category-sidecar generation and the final product boundary are intentionally kept under review. The benchmark code path itself is now scoped to the OpenViking-native trajectory route.

Type of Change

Bug fix
New feature / benchmark treatment
Breaking change
Documentation / benchmark tooling

Verification

.venv/bin/python -m pytest -o addopts= tests/benchmark/test_tau2_category_rerank.py
python3 benchmark/tau2/scripts/run_eval.py --config benchmark/tau2/config/category_rerank.yaml --plan-only --run-id prc_cleanup_plan_check2
python3 -m py_compile benchmark/tau2/scripts/category_rerank.py benchmark/tau2/scripts/run_eval.py benchmark/tau2/scripts/run_memory_v2_eval.py benchmark/tau2/scripts/build_category_catalog.py benchmark/tau2/scripts/build_category_requests.py benchmark/tau2/scripts/generate_category_annotations.py benchmark/tau2/scripts/run_category_annotation_batches.py

…memory

…codex/tau2-category-rerank-on-pr-b

…u2-category-rerank-on-pr-b # Conflicts: # benchmark/tau2/scripts/run_memory_v2_eval.py

This reverts commit dc12e32.

…codex/tau2-category-rerank-on-pr-b # Conflicts: # benchmark/tau2/scripts/run_eval.py # benchmark/tau2/scripts/run_memory_v2_eval.py

…erank-on-pr-b # Conflicts: # benchmark/tau2/scripts/run_eval.py # benchmark/tau2/scripts/run_memory_v2_eval.py

# Conflicts: # benchmark/tau2/README.md # benchmark/tau2/scripts/run_eval.py # benchmark/tau2/scripts/run_memory_v2_eval.py

…erank-on-pr-b

github-actions · 2026-05-15T19:22:17Z

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

🎫 Ticket compliance analysis
⏱️ Estimated effort to review: 3 🔵🔵🔵⚪⚪
🏅 Score: 85
🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 No multiple PR themes
⚡ No major issues detected

github-actions · 2026-05-15T19:23:53Z

Failed to generate code suggestions for PR

huangruiteng and others added 30 commits May 13, 2026 17:43

feat(benchmark): add TAU-2 trajectory memory treatment

d1caa77

style(benchmark): format tau2 trajectory scripts

a68d5e7

feat(benchmark): add tau2 trajectory category rerank

4391dd4

Merge remote-tracking branch 'origin/main' into feat/tau2-trajectory-…

0700781

…memory

refine trajectory memory view prompt

fddd7ba

Merge remote-tracking branch 'pr-b/feat-tau2-trajectory-memory' into …

5f8ea0b

…codex/tau2-category-rerank-on-pr-b

test(benchmark): cover tau2 category rerank helper

7cd7acd

feat(benchmark): prepare tau2 memory corpora before eval

0496000

align tau2 category rerank with harness baseline

8e3dc60

bench(tau2): align category rerank with FGMemory route

f7b3815

fix(benchmark): tighten trajectory evidence prompt

08d33a9

Merge remote-tracking branch 'fork/feat/tau2-trajectory-memory' into …

9d2719a

…codex/tau2-category-rerank-on-pr-b

bench(tau2): resolve memory eval artifact paths

1bdc6a9

fix(benchmark): guard tau2 infrastructure failures

9cfe362

Merge commit '9cfe362721cead6f5eaac7e0b6d5a3ada6580682' into codex/ta…

02125f7

…u2-category-rerank-on-pr-b # Conflicts: # benchmark/tau2/scripts/run_memory_v2_eval.py

bench(tau2): support category annotation sidecars

dc12e32

Revert "bench(tau2): support category annotation sidecars"

7c88bc4

This reverts commit dc12e32.

docs(tau2): clarify self-generated category signals

91f9edf

fix(benchmark): guard tau2 infrastructure failures

2b767f2

bench(tau2): summarize category trace coverage

a624451

fix(benchmark): resolve tau2 runner paths

63a1004

docs(tau2): document category trace summary

05932dd

fix(memory): add trajectory evidence examples

fb62c46

bench(tau2): align trajectory baseline guard

b613e3e

bench(tau2): report concrete memory trace coverage

cc1f009

bench(tau2): gate diagnostic category evidence

c0aa47a

bench(tau2): tighten category coverage diagnostics

005627f

bench(tau2): expose aggregate-only corpus probes

96af302

bench(tau2): gate aggregate-only corpus probes

1e96d6c

fix(benchmark): run no-memory tau2 eval in process

e30b79b

huangruiteng added 22 commits May 14, 2026 06:12

bench(tau2): align corpus probe width with rerank

15f66e9

bench(tau2): align no-memory runner with PR-B

88505d7

test(tau2): guard S89 category alignment

1977673

bench(tau2): require applied category runtime evidence

b4e1531

bench(tau2): summarize diagnostic evidence reasons

3be6924

bench(tau2): distinguish category coverage from match

7a8f078

Merge remote-tracking branch 'fork/feat/tau2-trajectory-memory' into …

9b5e93a

…codex/tau2-category-rerank-on-pr-b # Conflicts: # benchmark/tau2/scripts/run_eval.py # benchmark/tau2/scripts/run_memory_v2_eval.py

bench(tau2): require concrete category injection evidence

c5ac25d

bench(tau2): require injected concrete category match

7b6dc92

bench(tau2): align retrieval budget and fixed first user

662cf0b

bench(tau2): reuse memory corpora across eval runs

f85d60b

Merge branch 'feat/tau2-trajectory-memory' into codex/tau2-category-r…

68e4b4c

…erank-on-pr-b # Conflicts: # benchmark/tau2/scripts/run_eval.py # benchmark/tau2/scripts/run_memory_v2_eval.py

bench(tau2): add custom S84 category runner

436e2a4

bench(tau2): add scoped trajectory eval concurrency

d833980

Merge commit 'd833980e' into codex/tau2-category-rerank-on-pr-b

139f7a9

# Conflicts: # benchmark/tau2/README.md # benchmark/tau2/scripts/run_eval.py # benchmark/tau2/scripts/run_memory_v2_eval.py

style(benchmark): format tau2 eval runner

74c18db

Merge branch 'feat/tau2-trajectory-memory' into codex/tau2-category-r…

fe85652

…erank-on-pr-b

style(benchmark): format tau2 category rerank

b8884cf

bench(tau2): add first-user category diagnostic config

8648c5d

style(benchmark): satisfy tau2 eval lint

8cf7737

merge: sync category rerank with PR-B lint fix

9c9346e

bench(tau2): keep category rerank on trajectory memory

78be066

github-project-automation Bot added this to OpenViking project May 15, 2026

github-project-automation Bot moved this to Backlog in OpenViking project May 15, 2026

huangruiteng mentioned this pull request May 15, 2026

bench(tau2): add category-aware memory rerank treatment #2044

Closed

bench(tau2): add first-user category rerank variants

26a28da

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bench(tau2): add category-aware trajectory memory rerank#2079

bench(tau2): add category-aware trajectory memory rerank#2079
huangruiteng wants to merge 53 commits into
volcengine:mainfrom
huangruiteng:feat/tau2-category-rerank

huangruiteng commented May 15, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 15, 2026

Uh oh!

github-actions Bot commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

huangruiteng commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Status

Type of Change

Verification

Uh oh!

github-actions Bot commented May 15, 2026

PR Reviewer Guide 🔍

Uh oh!

github-actions Bot commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

huangruiteng commented May 15, 2026 •

edited

Loading