Skip to content

bench(tau2): add category-aware trajectory memory rerank#2079

Draft
huangruiteng wants to merge 53 commits into
volcengine:mainfrom
huangruiteng:feat/tau2-category-rerank
Draft

bench(tau2): add category-aware trajectory memory rerank#2079
huangruiteng wants to merge 53 commits into
volcengine:mainfrom
huangruiteng:feat/tau2-category-rerank

Conversation

@huangruiteng
Copy link
Copy Markdown
Contributor

@huangruiteng huangruiteng commented May 15, 2026

Description

Follow-up to #2017. This PR adds an opt-in TAU-2 benchmark treatment for category-aware reranking on top of OpenViking trajectory-view memory.

It keeps the benchmark path OpenViking-native:

  • train with train_memory_mode: experience_only;
  • retrieve from memories/trajectories;
  • add sidecar-based query / memory category annotations;
  • before write-like tool calls, retrieve 6 candidates, category-rerank them, and inject at most 2 memories;
  • emit runtime trace summaries and mark cells diagnostic when category/runtime evidence is missing.

The benchmark trains/evaluates its own OpenViking corpus through the config; it does not require a pre-existing external corpus.

Status

Local TAU-2 evidence is from reasoning-high, fixed-first-user, 8-repeat runs. The headline is that trajectory-view + category rerank gives a clear airline lift, while the no-category pre-write route is currently stronger on retail.

Route retail avg airline avg domain avg Read
No memory baseline 0.84688 0.75000 0.79844 Current paired OV-native baseline
Trajectory-view + pre-write/scope 0.85938 0.74375 0.80157 Best current retail route
Trajectory-view + pre-write + category rerank 0.83437 0.81250 0.82344 Clearest category positive signal, mainly airline
Best by domain 0.85938 0.81250 0.83594 Retail uses pre-write/scope; airline uses category rerank

Ablation context:

Treatment retail avg airline avg aggregate Read
Old trajectory prompt -> experiences, first-user only 0.83125 0.66875 0.77708 task-weighted Negative control; not positive evidence
PR-B trajectory prompt, first-user only 0.82182 0.76875 0.79528 domain avg Prompt route alone is close to baseline but not enough
PR-B trajectory prompt + pre-write/scope 0.85938 0.74375 0.80157 domain avg Retail benefits from pre-write/scope
PR-C category rerank 0.83437 0.81250 0.82344 domain avg Airline benefits from category selection

This PR is still a draft because category-sidecar generation and the final product boundary are intentionally kept under review. The benchmark code path itself is now scoped to the OpenViking-native trajectory route.

Type of Change

  • Bug fix
  • New feature / benchmark treatment
  • Breaking change
  • Documentation / benchmark tooling

Verification

  • .venv/bin/python -m pytest -o addopts= tests/benchmark/test_tau2_category_rerank.py
  • python3 benchmark/tau2/scripts/run_eval.py --config benchmark/tau2/config/category_rerank.yaml --plan-only --run-id prc_cleanup_plan_check2
  • python3 -m py_compile benchmark/tau2/scripts/category_rerank.py benchmark/tau2/scripts/run_eval.py benchmark/tau2/scripts/run_memory_v2_eval.py benchmark/tau2/scripts/build_category_catalog.py benchmark/tau2/scripts/build_category_requests.py benchmark/tau2/scripts/generate_category_annotations.py benchmark/tau2/scripts/run_category_annotation_batches.py

huangruiteng and others added 30 commits May 13, 2026 17:43
…u2-category-rerank-on-pr-b

# Conflicts:
#	benchmark/tau2/scripts/run_memory_v2_eval.py
…codex/tau2-category-rerank-on-pr-b

# Conflicts:
#	benchmark/tau2/scripts/run_eval.py
#	benchmark/tau2/scripts/run_memory_v2_eval.py
…erank-on-pr-b

# Conflicts:
#	benchmark/tau2/scripts/run_eval.py
#	benchmark/tau2/scripts/run_memory_v2_eval.py
# Conflicts:
#	benchmark/tau2/README.md
#	benchmark/tau2/scripts/run_eval.py
#	benchmark/tau2/scripts/run_memory_v2_eval.py
@github-actions
Copy link
Copy Markdown

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

**🎫 Ticket compliance analysis **

⏱️ Estimated effort to review: 3 🔵🔵🔵⚪⚪
🏅 Score: 85
🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 No multiple PR themes
⚡ No major issues detected

@github-actions
Copy link
Copy Markdown

Failed to generate code suggestions for PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

1 participant