bench(tau2): add category-aware memory rerank treatment by huangruiteng · Pull Request #2044 · volcengine/OpenViking

huangruiteng · 2026-05-14T09:41:53Z

Status

Draft follow-up to #2017. This branch is stacked on PR-B and is meant for code review / experiment discussion, not for immediate merge into main.

For the incremental diff against PR-B, review:
huangruiteng/OpenViking@feat/tau2-trajectory-memory...codex/tau2-category-rerank-on-pr-b

Please do not read this as a standalone productized OpenViking reranker yet. The goal is to make the Harness-proven pre-write + category route concrete enough for review: config, runtime trace, coverage gates, and a runnable TAU-2 treatment path.

Memory V2 Reference

This stacked branch has not rerun a fresh Memory V2 baseline. The relevant reference is the Harness reasoning-high S87/S88/S89 comparison:

Route	Retail	Airline	Task-weighted total
NoMem	0.83750	0.72500	0.80000
MemV2	0.75000	0.77500	0.75833
TrajView / custom S84 route	0.85000	0.80000	0.83333

This is why PR-C focuses on category-aware exposure rather than just adding more Memory V2 context: the current MemV2 route was not merely neutral; it was lower than no-memory in the S89-series comparison. The useful signal came from finer-grained procedure / trajectory context plus pre-write and category selection.

What This Adds

Adds a TAU-2 category catalog and category-aware rerank helper.
Adds category_rerank.yaml and custom_s84_category.yaml treatments.
Adds custom_s84_category_first_user.yaml as a next diagnostic config for applying category positive-match at both first-user and pre-write nodes.
Adds runtime evidence checks for category coverage, positive-match selection, concrete injected memory, and diagnostic reasons.
Adds scoped prompts / guards needed to reproduce the current S84-style route.
Adds tests for category scoring, mismatch handling, trace summary, and invalid-evidence gating.

Latest Eval Read

Current full8 OV-runner result for this branch:

Route	Retail	Airline	Task-weighted total	DB match	Status
PR-C custom S84 scope + category positive-match	0.82188	0.78750	0.81042	0.81875	16/16 cells complete, runtime evidence valid

Important boundary:

The first 4 repeats were strong: total 0.83750, roughly matching the S89 target anchor.
The full8 result regressed in repeats 5-8, so this is not yet a stable uplift claim.
Positive case: airline task 19 improved strongly under category + pre-write.
Regression cases to inspect next: retail task 49 and airline task 35.

So the current read is: the chain is complete and reviewable, but the policy still needs first-user budget / category gating work before it can be presented as stable outcome improvement.

Validation

git diff --check
python3 -m py_compile benchmark/tau2/scripts/run_eval.py benchmark/tau2/scripts/run_memory_v2_eval.py benchmark/tau2/scripts/category_rerank.py
uvx ruff format --check benchmark/tau2/scripts/run_eval.py benchmark/tau2/scripts/run_memory_v2_eval.py benchmark/tau2/scripts/category_rerank.py
uvx ruff check benchmark/tau2/scripts/run_eval.py benchmark/tau2/scripts/run_memory_v2_eval.py benchmark/tau2/scripts/category_rerank.py

Pytest note: direct local pytest tests/benchmark/test_tau2_category_rerank.py is blocked in this checkout by the bundled AGFS extension import path unless the project is installed with its native extension. CI should run the repository-native environment.

…memory

…codex/tau2-category-rerank-on-pr-b

…u2-category-rerank-on-pr-b # Conflicts: # benchmark/tau2/scripts/run_memory_v2_eval.py

This reverts commit dc12e32.

…codex/tau2-category-rerank-on-pr-b # Conflicts: # benchmark/tau2/scripts/run_eval.py # benchmark/tau2/scripts/run_memory_v2_eval.py

…erank-on-pr-b # Conflicts: # benchmark/tau2/scripts/run_eval.py # benchmark/tau2/scripts/run_memory_v2_eval.py

# Conflicts: # benchmark/tau2/README.md # benchmark/tau2/scripts/run_eval.py # benchmark/tau2/scripts/run_memory_v2_eval.py

…erank-on-pr-b

github-actions · 2026-05-14T09:45:37Z

PR Code Suggestions ✨

No code suggestions found for the PR.

huangruiteng · 2026-05-15T19:22:02Z

Superseded by #2079. I renamed the branch to follow the repository convention () and cleaned the PR scope so category rerank stays on the OpenViking-native trajectory-view route, without external procedure corpus support.

huangruiteng and others added 30 commits May 13, 2026 17:43

feat(benchmark): add TAU-2 trajectory memory treatment

d1caa77

style(benchmark): format tau2 trajectory scripts

a68d5e7

feat(benchmark): add tau2 trajectory category rerank

4391dd4

Merge remote-tracking branch 'origin/main' into feat/tau2-trajectory-…

0700781

…memory

refine trajectory memory view prompt

fddd7ba

Merge remote-tracking branch 'pr-b/feat-tau2-trajectory-memory' into …

5f8ea0b

…codex/tau2-category-rerank-on-pr-b

test(benchmark): cover tau2 category rerank helper

7cd7acd

feat(benchmark): prepare tau2 memory corpora before eval

0496000

align tau2 category rerank with harness baseline

8e3dc60

bench(tau2): align category rerank with FGMemory route

f7b3815

fix(benchmark): tighten trajectory evidence prompt

08d33a9

Merge remote-tracking branch 'fork/feat/tau2-trajectory-memory' into …

9d2719a

…codex/tau2-category-rerank-on-pr-b

bench(tau2): resolve memory eval artifact paths

1bdc6a9

fix(benchmark): guard tau2 infrastructure failures

9cfe362

Merge commit '9cfe362721cead6f5eaac7e0b6d5a3ada6580682' into codex/ta…

02125f7

…u2-category-rerank-on-pr-b # Conflicts: # benchmark/tau2/scripts/run_memory_v2_eval.py

bench(tau2): support category annotation sidecars

dc12e32

Revert "bench(tau2): support category annotation sidecars"

7c88bc4

This reverts commit dc12e32.

docs(tau2): clarify self-generated category signals

91f9edf

fix(benchmark): guard tau2 infrastructure failures

2b767f2

bench(tau2): summarize category trace coverage

a624451

fix(benchmark): resolve tau2 runner paths

63a1004

docs(tau2): document category trace summary

05932dd

fix(memory): add trajectory evidence examples

fb62c46

bench(tau2): align trajectory baseline guard

b613e3e

bench(tau2): report concrete memory trace coverage

cc1f009

bench(tau2): gate diagnostic category evidence

c0aa47a

bench(tau2): tighten category coverage diagnostics

005627f

bench(tau2): expose aggregate-only corpus probes

96af302

bench(tau2): gate aggregate-only corpus probes

1e96d6c

fix(benchmark): run no-memory tau2 eval in process

e30b79b

huangruiteng added 18 commits May 14, 2026 06:12

bench(tau2): align corpus probe width with rerank

15f66e9

bench(tau2): align no-memory runner with PR-B

88505d7

test(tau2): guard S89 category alignment

1977673

bench(tau2): require applied category runtime evidence

b4e1531

bench(tau2): summarize diagnostic evidence reasons

3be6924

bench(tau2): distinguish category coverage from match

7a8f078

Merge remote-tracking branch 'fork/feat/tau2-trajectory-memory' into …

9b5e93a

…codex/tau2-category-rerank-on-pr-b # Conflicts: # benchmark/tau2/scripts/run_eval.py # benchmark/tau2/scripts/run_memory_v2_eval.py

bench(tau2): require concrete category injection evidence

c5ac25d

bench(tau2): require injected concrete category match

7b6dc92

bench(tau2): align retrieval budget and fixed first user

662cf0b

bench(tau2): reuse memory corpora across eval runs

f85d60b

Merge branch 'feat/tau2-trajectory-memory' into codex/tau2-category-r…

68e4b4c

…erank-on-pr-b # Conflicts: # benchmark/tau2/scripts/run_eval.py # benchmark/tau2/scripts/run_memory_v2_eval.py

bench(tau2): add custom S84 category runner

436e2a4

bench(tau2): add scoped trajectory eval concurrency

d833980

Merge commit 'd833980e' into codex/tau2-category-rerank-on-pr-b

139f7a9

# Conflicts: # benchmark/tau2/README.md # benchmark/tau2/scripts/run_eval.py # benchmark/tau2/scripts/run_memory_v2_eval.py

style(benchmark): format tau2 eval runner

74c18db

Merge branch 'feat/tau2-trajectory-memory' into codex/tau2-category-r…

fe85652

…erank-on-pr-b

style(benchmark): format tau2 category rerank

b8884cf

github-project-automation Bot added this to OpenViking project May 14, 2026

github-project-automation Bot moved this to Backlog in OpenViking project May 14, 2026

huangruiteng mentioned this pull request May 14, 2026

feat(memory): upgrade trajectory extraction to beat no-memory baseline #2017

Merged

bench(tau2): add first-user category diagnostic config

8648c5d

huangruiteng added 2 commits May 14, 2026 17:46

style(benchmark): satisfy tau2 eval lint

8cf7737

merge: sync category rerank with PR-B lint fix

9c9346e

huangruiteng closed this May 15, 2026

github-project-automation Bot moved this from Backlog to Done in OpenViking project May 15, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bench(tau2): add category-aware memory rerank treatment#2044

bench(tau2): add category-aware memory rerank treatment#2044
huangruiteng wants to merge 51 commits into
volcengine:mainfrom
huangruiteng:codex/tau2-category-rerank-on-pr-b

huangruiteng commented May 14, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 14, 2026

Uh oh!

huangruiteng commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

huangruiteng commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Status

Memory V2 Reference

What This Adds

Latest Eval Read

Validation

Uh oh!

github-actions Bot commented May 14, 2026

PR Code Suggestions ✨

Uh oh!

huangruiteng commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

huangruiteng commented May 14, 2026 •

edited

Loading