bench(tau2): add category-aware memory rerank treatment#2044
Closed
huangruiteng wants to merge 51 commits into
Closed
bench(tau2): add category-aware memory rerank treatment#2044huangruiteng wants to merge 51 commits into
huangruiteng wants to merge 51 commits into
Conversation
…codex/tau2-category-rerank-on-pr-b
…codex/tau2-category-rerank-on-pr-b
…u2-category-rerank-on-pr-b # Conflicts: # benchmark/tau2/scripts/run_memory_v2_eval.py
This reverts commit dc12e32.
…codex/tau2-category-rerank-on-pr-b # Conflicts: # benchmark/tau2/scripts/run_eval.py # benchmark/tau2/scripts/run_memory_v2_eval.py
…erank-on-pr-b # Conflicts: # benchmark/tau2/scripts/run_eval.py # benchmark/tau2/scripts/run_memory_v2_eval.py
# Conflicts: # benchmark/tau2/README.md # benchmark/tau2/scripts/run_eval.py # benchmark/tau2/scripts/run_memory_v2_eval.py
PR Code Suggestions ✨No code suggestions found for the PR. |
Contributor
Author
|
Superseded by #2079. I renamed the branch to follow the repository convention () and cleaned the PR scope so category rerank stays on the OpenViking-native trajectory-view route, without external procedure corpus support. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Status
Draft follow-up to #2017. This branch is stacked on PR-B and is meant for code review / experiment discussion, not for immediate merge into
main.For the incremental diff against PR-B, review:
huangruiteng/OpenViking@feat/tau2-trajectory-memory...codex/tau2-category-rerank-on-pr-b
Please do not read this as a standalone productized OpenViking reranker yet. The goal is to make the Harness-proven
pre-write + categoryroute concrete enough for review: config, runtime trace, coverage gates, and a runnable TAU-2 treatment path.Memory V2 Reference
This stacked branch has not rerun a fresh Memory V2 baseline. The relevant reference is the Harness reasoning-high S87/S88/S89 comparison:
This is why PR-C focuses on category-aware exposure rather than just adding more Memory V2 context: the current MemV2 route was not merely neutral; it was lower than no-memory in the S89-series comparison. The useful signal came from finer-grained procedure / trajectory context plus pre-write and category selection.
What This Adds
category_rerank.yamlandcustom_s84_category.yamltreatments.custom_s84_category_first_user.yamlas a next diagnostic config for applying category positive-match at both first-user and pre-write nodes.Latest Eval Read
Current full8 OV-runner result for this branch:
Important boundary:
0.83750, roughly matching the S89 target anchor.So the current read is: the chain is complete and reviewable, but the policy still needs first-user budget / category gating work before it can be presented as stable outcome improvement.
Validation
git diff --checkpython3 -m py_compile benchmark/tau2/scripts/run_eval.py benchmark/tau2/scripts/run_memory_v2_eval.py benchmark/tau2/scripts/category_rerank.pyuvx ruff format --check benchmark/tau2/scripts/run_eval.py benchmark/tau2/scripts/run_memory_v2_eval.py benchmark/tau2/scripts/category_rerank.pyuvx ruff check benchmark/tau2/scripts/run_eval.py benchmark/tau2/scripts/run_memory_v2_eval.py benchmark/tau2/scripts/category_rerank.pyPytest note: direct local
pytest tests/benchmark/test_tau2_category_rerank.pyis blocked in this checkout by the bundled AGFS extension import path unless the project is installed with its native extension. CI should run the repository-native environment.