feat(memory): upgrade trajectory extraction to beat no-memory baseline#2017
Conversation
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
PR Code Suggestions ✨No code suggestions found for the PR. |
9cfe362 to
c2228a2
Compare
c2228a2 to
2b767f2
Compare
|
Follow-up PR for the scope-prompt fairness read: #2172. Why this exists: the strongest PR-B result used the TAU-2 domain scope prompt, and a same-scope no-memory rerun showed that the scope prompt itself can move the baseline. #2172 adds the missing no-memory scope wiring plus a generic advisory-memory scope prompt/config so the PR-B uplift can be read in two clean ways:
This does not change the main PR-B mechanism; it makes the attribution boundary clearer and gives us a benchmark-neutral scope option. |
|
Follow-up after the PR-B merge: opened #2255 to make trajectory retrieval use a compact Clean TAU-2 full8 read, same fixed-first-user + generic scope + trajectory top4 + pre-write top2 protocol:
This keeps the change narrow: no category rerank, controller, failure overlay, or runner filter; just a better positive applicability text for trajectory memory vector search. |
Status
Follow-up to #2003. This PR upgrades OpenViking's default trajectory extraction protocol: instead of leaving a training trajectory as a raw conversation-shaped record, Memory V2 now extracts compact procedure memories with operation intent, applicability boundaries, write-field provenance, and anti-patterns.
TAU-2 is the validation battleground for this protocol change, not the whole scope of the idea. The broader goal is to make future agent trajectories reusable as decision-time procedure memory across tasks; TAU-2 gives us a strict held-out tool-use benchmark to prove the memory shape is useful.
Headline: on full TAU-2 8-trial evaluation over
retail + airline, the optimized trajectory-memory route clearly beats the no-memory baseline: 0.85313 vs 0.80156 domain-avg reward (+0.05157; retail 0.89375 / airline 0.81250). The complete content-shape matrix establishes the same route at 0.84688 before the latest fixed-protocol recheck, so the PR now reports 0.85313 as the headline best score.Goal / Trajectory / Result / Fail reasonprompt; trajectory recall is effectively flat.A legacy/default prompt control now makes the mechanism clearer: when the same success-only corpus setup is rebuilt with the older
Goal / Trajectory / Result / Fail reasontrajectory prompt, the same trajectory top4 first-user + pre-write route reaches only 0.79531 domain-avg reward. The old-prompt experience route is stronger than old trajectory recall, with a best experience control at 0.81250, but it still trails the optimized procedure-shaped trajectory route. The main lift is therefore not explained by success-only training, transcript replay format, or a memory budget alone.Claim boundary: these are TAU-2 held-out full8 reward scores: full test split, 8 repeats,
retail + airline, reasoning-high, fixed first user where applicable. The headline content-shape matrix completed with 144/144 valid cells and no nonzero return codes; the latest headline-route recheck completed 16/16 valid cells. The legacy/default prompt control completed with 208/208 valid cells. Category rerank and other diagnostic controls stay outside this PR claim.Known follow-up: the coarse lifecycle label is intentionally coarse. It is useful in this PR as a lifecycle boundary, but a more general trajectory protocol should eventually split it into orthogonal fields such as effect type, object lifecycle, terminality, and failure / anti-pattern type, then validate that broader schema beyond TAU-2 before changing the default.
What Changed
benchmark/tau2/config/prb_content_matrix_new_prompt.yamlso reviewers can reproduce the no-memory control, trajectory top4, experience top2, and 4000-character budget ablation from one config.Evidence
All numbers below are TAU-2 held-out test reward scores: 8 repeats,
retail + airline, fixed first user where applicable, reasoning-high. DB-match scoreboards are preserved in the run artifacts, but reward is the primary metric.Main no-category trajectory routes
Content-shape ablation
The content-shape matrix completed with 144/144 valid cells and 0 nonzero return codes; the PR config now reproduces that same 9-strategy evidence grid. The result supports fixed-count procedure-shaped trajectory memory as the PR-B default; experience and 4000-character budget variants remain useful ablations, but not the headline route.
Legacy/default prompt control
This control keeps the same success-only and TAU-2 replay setup, but uses the older trajectory prompt shape instead of the procedure/action-boundary protocol. It is included to separate the prompt/protocol contribution from corpus hygiene and budget effects.
The no-memory rows come from each matrix's own paired control, so the most useful read is both the absolute domain average and the delta against that matrix's no-memory baseline.
The legacy/default prompt control completed with 208/208 valid cells. The complete comparison is intentionally not one-sided: legacy 4000-character pre-write has a couple of small wins, but the fixed-count trajectory routes, especially first-user + pre-write, are much stronger with the procedure/action-boundary extraction protocol. This supports keeping the extraction-protocol change as the center of the PR rather than treating the result as a corpus-only or budget-only effect.
Success + failure outcome-label ablation
This is a risk check for reviewers who worry that the new extraction prompt might only be safe on successful trajectories. I rebuilt success+failure corpora where failed training sessions carry only a minimal outcome label, not TAU-2 reward assertions or evaluator feedback. The result does not beat the success-only headline, so this PR keeps the default corpus success-only and leaves failure memories for a separate anti-pattern / root-cause compression design.
Reproduce
After TAU-2, model credentials, and a clean local OpenViking service are configured, run a tiny wiring smoke:
Run the full PR-B evidence matrix:
The main aggregate is
benchmark/tau2/result/prb_content_matrix_new_prompt_full8/scoreboard.json; per-cell details are incell_results/, and corpus identity / generated memory checks are inmemory_corpora/.Validation
git diff --check/Users/bytedance/Documents/OpenViking-pr-b-trajectory/.venv/bin/python -m py_compile benchmark/tau2/scripts/run_eval.py benchmark/tau2/scripts/run_memory_v2_eval.py openviking/session/memory/agent_trajectory_context_provider.pybenchmark/tau2/config/prb_content_matrix_new_prompt.yaml: 144 cells planned, 144 executable, char-budget flags emitted for the 4000-character variants.0.75625.0.82344domain-avg reward, below the procedure-shaped headline0.84688.0.89375, airline0.81250, domain-avg reward0.85313.Thanks @yangxinxin-7 for the TAU-2 benchmark scaffold in #2003; this PR keeps using that workflow rather than adding a separate eval path.