bench(tau2): add category-aware trajectory memory rerank#2079
Draft
huangruiteng wants to merge 53 commits into
Draft
bench(tau2): add category-aware trajectory memory rerank#2079huangruiteng wants to merge 53 commits into
huangruiteng wants to merge 53 commits into
Conversation
…codex/tau2-category-rerank-on-pr-b
…codex/tau2-category-rerank-on-pr-b
…u2-category-rerank-on-pr-b # Conflicts: # benchmark/tau2/scripts/run_memory_v2_eval.py
This reverts commit dc12e32.
…codex/tau2-category-rerank-on-pr-b # Conflicts: # benchmark/tau2/scripts/run_eval.py # benchmark/tau2/scripts/run_memory_v2_eval.py
…erank-on-pr-b # Conflicts: # benchmark/tau2/scripts/run_eval.py # benchmark/tau2/scripts/run_memory_v2_eval.py
# Conflicts: # benchmark/tau2/README.md # benchmark/tau2/scripts/run_eval.py # benchmark/tau2/scripts/run_memory_v2_eval.py
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
|
Failed to generate code suggestions for PR |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Follow-up to #2017. This PR adds an opt-in TAU-2 benchmark treatment for category-aware reranking on top of OpenViking trajectory-view memory.
It keeps the benchmark path OpenViking-native:
train_memory_mode: experience_only;memories/trajectories;The benchmark trains/evaluates its own OpenViking corpus through the config; it does not require a pre-existing external corpus.
Status
Local TAU-2 evidence is from reasoning-high, fixed-first-user, 8-repeat runs. The headline is that trajectory-view + category rerank gives a clear airline lift, while the no-category pre-write route is currently stronger on retail.
Ablation context:
This PR is still a draft because category-sidecar generation and the final product boundary are intentionally kept under review. The benchmark code path itself is now scoped to the OpenViking-native trajectory route.
Type of Change
Verification
.venv/bin/python -m pytest -o addopts= tests/benchmark/test_tau2_category_rerank.pypython3 benchmark/tau2/scripts/run_eval.py --config benchmark/tau2/config/category_rerank.yaml --plan-only --run-id prc_cleanup_plan_check2python3 -m py_compile benchmark/tau2/scripts/category_rerank.py benchmark/tau2/scripts/run_eval.py benchmark/tau2/scripts/run_memory_v2_eval.py benchmark/tau2/scripts/build_category_catalog.py benchmark/tau2/scripts/build_category_requests.py benchmark/tau2/scripts/generate_category_annotations.py benchmark/tau2/scripts/run_category_annotation_batches.py