Skip to content

bench(tau2): add category-aware memory rerank treatment#2044

Closed
huangruiteng wants to merge 51 commits into
volcengine:mainfrom
huangruiteng:codex/tau2-category-rerank-on-pr-b
Closed

bench(tau2): add category-aware memory rerank treatment#2044
huangruiteng wants to merge 51 commits into
volcengine:mainfrom
huangruiteng:codex/tau2-category-rerank-on-pr-b

Conversation

@huangruiteng
Copy link
Copy Markdown
Contributor

@huangruiteng huangruiteng commented May 14, 2026

Status

Draft follow-up to #2017. This branch is stacked on PR-B and is meant for code review / experiment discussion, not for immediate merge into main.

For the incremental diff against PR-B, review:
huangruiteng/OpenViking@feat/tau2-trajectory-memory...codex/tau2-category-rerank-on-pr-b

Please do not read this as a standalone productized OpenViking reranker yet. The goal is to make the Harness-proven pre-write + category route concrete enough for review: config, runtime trace, coverage gates, and a runnable TAU-2 treatment path.

Memory V2 Reference

This stacked branch has not rerun a fresh Memory V2 baseline. The relevant reference is the Harness reasoning-high S87/S88/S89 comparison:

Route Retail Airline Task-weighted total
NoMem 0.83750 0.72500 0.80000
MemV2 0.75000 0.77500 0.75833
TrajView / custom S84 route 0.85000 0.80000 0.83333

This is why PR-C focuses on category-aware exposure rather than just adding more Memory V2 context: the current MemV2 route was not merely neutral; it was lower than no-memory in the S89-series comparison. The useful signal came from finer-grained procedure / trajectory context plus pre-write and category selection.

What This Adds

  • Adds a TAU-2 category catalog and category-aware rerank helper.
  • Adds category_rerank.yaml and custom_s84_category.yaml treatments.
  • Adds custom_s84_category_first_user.yaml as a next diagnostic config for applying category positive-match at both first-user and pre-write nodes.
  • Adds runtime evidence checks for category coverage, positive-match selection, concrete injected memory, and diagnostic reasons.
  • Adds scoped prompts / guards needed to reproduce the current S84-style route.
  • Adds tests for category scoring, mismatch handling, trace summary, and invalid-evidence gating.

Latest Eval Read

Current full8 OV-runner result for this branch:

Route Retail Airline Task-weighted total DB match Status
PR-C custom S84 scope + category positive-match 0.82188 0.78750 0.81042 0.81875 16/16 cells complete, runtime evidence valid

Important boundary:

  • The first 4 repeats were strong: total 0.83750, roughly matching the S89 target anchor.
  • The full8 result regressed in repeats 5-8, so this is not yet a stable uplift claim.
  • Positive case: airline task 19 improved strongly under category + pre-write.
  • Regression cases to inspect next: retail task 49 and airline task 35.

So the current read is: the chain is complete and reviewable, but the policy still needs first-user budget / category gating work before it can be presented as stable outcome improvement.

Validation

  • git diff --check
  • python3 -m py_compile benchmark/tau2/scripts/run_eval.py benchmark/tau2/scripts/run_memory_v2_eval.py benchmark/tau2/scripts/category_rerank.py
  • uvx ruff format --check benchmark/tau2/scripts/run_eval.py benchmark/tau2/scripts/run_memory_v2_eval.py benchmark/tau2/scripts/category_rerank.py
  • uvx ruff check benchmark/tau2/scripts/run_eval.py benchmark/tau2/scripts/run_memory_v2_eval.py benchmark/tau2/scripts/category_rerank.py

Pytest note: direct local pytest tests/benchmark/test_tau2_category_rerank.py is blocked in this checkout by the bundled AGFS extension import path unless the project is installed with its native extension. CI should run the repository-native environment.

huangruiteng and others added 30 commits May 13, 2026 17:43
…u2-category-rerank-on-pr-b

# Conflicts:
#	benchmark/tau2/scripts/run_memory_v2_eval.py
…codex/tau2-category-rerank-on-pr-b

# Conflicts:
#	benchmark/tau2/scripts/run_eval.py
#	benchmark/tau2/scripts/run_memory_v2_eval.py
…erank-on-pr-b

# Conflicts:
#	benchmark/tau2/scripts/run_eval.py
#	benchmark/tau2/scripts/run_memory_v2_eval.py
# Conflicts:
#	benchmark/tau2/README.md
#	benchmark/tau2/scripts/run_eval.py
#	benchmark/tau2/scripts/run_memory_v2_eval.py
@github-actions
Copy link
Copy Markdown

PR Code Suggestions ✨

No code suggestions found for the PR.

@huangruiteng
Copy link
Copy Markdown
Contributor Author

Superseded by #2079. I renamed the branch to follow the repository convention () and cleaned the PR scope so category rerank stays on the OpenViking-native trajectory-view route, without external procedure corpus support.

@github-project-automation github-project-automation Bot moved this from Backlog to Done in OpenViking project May 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

1 participant