Skip to content

feat(memory): upgrade trajectory extraction to beat no-memory baseline#2017

Merged
yangxinxin-7 merged 22 commits into
volcengine:mainfrom
huangruiteng:feat/tau2-trajectory-memory
May 21, 2026
Merged

feat(memory): upgrade trajectory extraction to beat no-memory baseline#2017
yangxinxin-7 merged 22 commits into
volcengine:mainfrom
huangruiteng:feat/tau2-trajectory-memory

Conversation

@huangruiteng
Copy link
Copy Markdown
Contributor

@huangruiteng huangruiteng commented May 13, 2026

Status

Follow-up to #2003. This PR upgrades OpenViking's default trajectory extraction protocol: instead of leaving a training trajectory as a raw conversation-shaped record, Memory V2 now extracts compact procedure memories with operation intent, applicability boundaries, write-field provenance, and anti-patterns.

TAU-2 is the validation battleground for this protocol change, not the whole scope of the idea. The broader goal is to make future agent trajectories reusable as decision-time procedure memory across tasks; TAU-2 gives us a strict held-out tool-use benchmark to prove the memory shape is useful.

Headline: on full TAU-2 8-trial evaluation over retail + airline, the optimized trajectory-memory route clearly beats the no-memory baseline: 0.85313 vs 0.80156 domain-avg reward (+0.05157; retail 0.89375 / airline 0.81250). The complete content-shape matrix establishes the same route at 0.84688 before the latest fixed-protocol recheck, so the PR now reports 0.85313 as the headline best score.

Route Retail reward Airline reward Domain-avg reward Delta vs paired no-memory Read
No-memory reference 0.84688 0.75625 0.80156 - Same held-out 8-trial protocol; airline uses the refreshed PR-B no-memory rerun.
Legacy/default trajectory memory, top4, first-user + pre-write + scope 0.84375 0.74688 0.79531 +0.00156 Same route as the optimized headline, but with the older Goal / Trajectory / Result / Fail reason prompt; trajectory recall is effectively flat.
Legacy/default experience memory, best experience route 0.83125 0.79375 0.81250 +0.01875 Strongest old-prompt experience control: 4000-char pre-write-only; useful but still below optimized trajectory memory.
Optimized trajectory memory, top4, first-user + pre-write + scope 0.89375 0.81250 0.85313 +0.05157 Current strongest PR-B evidence; fixed protocol, scope enabled, no category/selector/controller/failure variant.

A legacy/default prompt control now makes the mechanism clearer: when the same success-only corpus setup is rebuilt with the older Goal / Trajectory / Result / Fail reason trajectory prompt, the same trajectory top4 first-user + pre-write route reaches only 0.79531 domain-avg reward. The old-prompt experience route is stronger than old trajectory recall, with a best experience control at 0.81250, but it still trails the optimized procedure-shaped trajectory route. The main lift is therefore not explained by success-only training, transcript replay format, or a memory budget alone.

Claim boundary: these are TAU-2 held-out full8 reward scores: full test split, 8 repeats, retail + airline, reasoning-high, fixed first user where applicable. The headline content-shape matrix completed with 144/144 valid cells and no nonzero return codes; the latest headline-route recheck completed 16/16 valid cells. The legacy/default prompt control completed with 208/208 valid cells. Category rerank and other diagnostic controls stay outside this PR claim.

Known follow-up: the coarse lifecycle label is intentionally coarse. It is useful in this PR as a lifecycle boundary, but a more general trajectory protocol should eventually split it into orthogonal fields such as effect type, object lifecycle, terminality, and failure / anti-pattern type, then validate that broader schema beyond TAU-2 before changing the default.

What Changed

  • Refines the trajectory extraction prompt into a procedure-memory protocol: operation intent, preconditions, immutable-object boundary, procedure, write-field provenance, anti-patterns, applicability, and negative applicability.
  • Uses coarse lifecycle labels as lifecycle boundaries rather than benchmark task categories; their job is to keep one memory from blending unrelated reads, writes, handoffs, and final responses.
  • Allows a session to produce multiple reusable trajectory memories when it contains separate intents or lifecycle transitions, instead of forcing one whole-session memory.
  • Keeps the memory generally reusable: concrete IDs, payment details, exact dates, and case-specific values should become evidence only when needed, not the trigger for future recall.
  • Adds TAU-2 runner support for:
    • trajectory vs experience retrieval selection;
    • first-user, pre-write, and combined retrieval nodes;
    • scope prompt files;
    • reusable corpus preparation with corpus identity checks;
    • fixed-count injection and optional injected-memory character budgets;
    • safe corpus reuse across experience/trajectory buckets by rebuilding the search URI from the current strategy memory type.
  • Adds benchmark/tau2/config/prb_content_matrix_new_prompt.yaml so reviewers can reproduce the no-memory control, trajectory top4, experience top2, and 4000-character budget ablation from one config.
  • Removes diagnostic controls from the PR-B benchmark path; selector, controller, category rerank, and failure-reflection variants are archived for separate experiments.

Evidence

All numbers below are TAU-2 held-out test reward scores: 8 repeats, retail + airline, fixed first user where applicable, reasoning-high. DB-match scoreboards are preserved in the run artifacts, but reward is the primary metric.

Main no-category trajectory routes

Route Retail Airline Domain avg Delta vs no-memory Notes
No-memory reference 0.84688 0.75625 0.80156 - Baseline for this PR description.
Trajectory top4, first-user + scope 0.88750 0.78125 0.83438 +0.03281 Strong retail lift and positive airline lift.
Trajectory top4, pre-write-only + scope 0.87500 0.78125 0.82813 +0.02656 Positive two-domain average.
Trajectory top4, first-user + pre-write + scope, headline recheck 0.89375 0.81250 0.85313 +0.05157 Current strongest PR-B evidence.

Content-shape ablation

Shape Best route Retail Airline Domain avg Read
Trajectory fixed top4 first-user + pre-write 0.88750 0.80625 0.84688 Headline candidate.
Experience fixed top2 first-user + pre-write 0.88438 0.76250 0.82344 Retail helps; airline is near baseline.
Trajectory 4000-char pre-write-only 0.85938 0.77500 0.81719 Budget usually injects only one trajectory.
Experience 4000-char first-user + pre-write 0.85625 0.76875 0.81250 Weaker than fixed-count; avg injected is about 1.3-2.0 memories depending on node/domain.

The content-shape matrix completed with 144/144 valid cells and 0 nonzero return codes; the PR config now reproduces that same 9-strategy evidence grid. The result supports fixed-count procedure-shaped trajectory memory as the PR-B default; experience and 4000-character budget variants remain useful ablations, but not the headline route.

Legacy/default prompt control

This control keeps the same success-only and TAU-2 replay setup, but uses the older trajectory prompt shape instead of the procedure/action-boundary protocol. It is included to separate the prompt/protocol contribution from corpus hygiene and budget effects.

The no-memory rows come from each matrix's own paired control, so the most useful read is both the absolute domain average and the delta against that matrix's no-memory baseline.

Shape Route Legacy/default prompt avg Legacy delta Procedure-shaped prompt avg Procedure delta New - legacy Read
No-memory - 0.79375 - 0.80156 - +0.00781 Separate paired controls; not the claimed prompt effect.
Trajectory fixed top4 first-user 0.79688 +0.00313 0.83437 +0.03281 +0.03749 Procedure shape makes first-turn recall useful.
Trajectory fixed top4 pre-write-only 0.80781 +0.01406 0.82812 +0.02656 +0.02031 Both positive; procedure shape is stronger.
Trajectory fixed top4 first-user + pre-write 0.79531 +0.00156 0.84688 +0.04531 +0.05157 Headline PR-B route.
Experience fixed top2 first-user 0.80312 +0.00938 0.81406 +0.01250 +0.01094 Small positive control; not headline.
Experience fixed top2 pre-write-only 0.79844 +0.00469 0.82031 +0.01875 +0.02187 Procedure branch is stronger.
Experience fixed top2 first-user + pre-write 0.78125 -0.01250 0.82344 +0.02188 +0.04219 Legacy combo hurts; procedure combo helps.
Trajectory 4000-char first-user 0.79375 +0.00000 0.81563 +0.01406 +0.02188 Budgeted first-user remains weaker than fixed top4.
Trajectory 4000-char pre-write-only 0.82344 +0.02969 0.81719 +0.01562 -0.00625 Best legacy route, but still below procedure fixed top4 combo.
Trajectory 4000-char first-user + pre-write 0.81094 +0.01719 0.82031 +0.01875 +0.00937 Both positive but below headline.
Experience 4000-char first-user 0.77812 -0.01562 0.80000 -0.00156 +0.02188 Neither should be headline.
Experience 4000-char pre-write-only 0.81250 +0.01875 0.80625 +0.00469 -0.00625 Legacy budget has a small edge here, but route is not best overall.
Experience 4000-char first-user + pre-write 0.77969 -0.01406 0.81250 +0.01094 +0.03281 Procedure branch avoids the legacy combo regression.

The legacy/default prompt control completed with 208/208 valid cells. The complete comparison is intentionally not one-sided: legacy 4000-character pre-write has a couple of small wins, but the fixed-count trajectory routes, especially first-user + pre-write, are much stronger with the procedure/action-boundary extraction protocol. This supports keeping the extraction-protocol change as the center of the PR rather than treating the result as a corpus-only or budget-only effect.

Success + failure outcome-label ablation

This is a risk check for reviewers who worry that the new extraction prompt might only be safe on successful trajectories. I rebuilt success+failure corpora where failed training sessions carry only a minimal outcome label, not TAU-2 reward assertions or evaluator feedback. The result does not beat the success-only headline, so this PR keeps the default corpus success-only and leaves failure memories for a separate anti-pattern / root-cause compression design.

Prompt / corpus Route Retail Airline Domain avg Read
Optimized trajectory prompt, success-only trajectory top4, first-user + pre-write 0.89375 0.81250 0.85313 PR-B headline; fixed protocol, 16/16 valid cells.
Optimized trajectory prompt, success+failure label-only trajectory top4, first-user + pre-write 0.85000 0.76875 0.80938 Fixed risk check, 32/32 valid cells; below success-only.
Optimized trajectory prompt, success+failure label-only experience top2, first-user 0.84688 0.71875 0.78281 Fixed risk check; experience-only also does not recover the gap.
Legacy/default prompt, success+failure label-only trajectory top4, first-user + pre-write 0.81875 0.72500 0.77188 Diagnostic old-prompt control; lower than the procedure-shaped prompt.
Legacy/default prompt, success+failure label-only experience top2, first-user 0.81250 0.70625 0.75938 Diagnostic old-prompt control; kept as a failure-corpus sanity check.

Reproduce

After TAU-2, model credentials, and a clean local OpenViking service are configured, run a tiny wiring smoke:

benchmark/tau2/run_full_eval.sh \
  --config benchmark/tau2/config/prb_content_matrix_new_prompt.yaml \
  --domain retail \
  --strategy-id new_traj_fixed_first_user_prewrite \
  --num-tasks 1 \
  --train-num-tasks 1 \
  --repeat-count 1 \
  --strict-preflight \
  --execute

Run the full PR-B evidence matrix:

benchmark/tau2/run_full_eval.sh \
  --config benchmark/tau2/config/prb_content_matrix_new_prompt.yaml \
  --run-id prb_content_matrix_new_prompt_full8 \
  --strict-preflight \
  --execute

The main aggregate is benchmark/tau2/result/prb_content_matrix_new_prompt_full8/scoreboard.json; per-cell details are in cell_results/, and corpus identity / generated memory checks are in memory_corpora/.

Validation

  • git diff --check
  • sensitive-token coarse scan over the staged diff
  • /Users/bytedance/Documents/OpenViking-pr-b-trajectory/.venv/bin/python -m py_compile benchmark/tau2/scripts/run_eval.py benchmark/tau2/scripts/run_memory_v2_eval.py openviking/session/memory/agent_trajectory_context_provider.py
  • Plan-only check for benchmark/tau2/config/prb_content_matrix_new_prompt.yaml: 144 cells planned, 144 executable, char-budget flags emitted for the 4000-character variants.
  • Fresh airline no-memory full8 baseline: reward 0.75625.
  • PR-B content-shape matrix: 144/144 cells valid, all returncode 0.
  • Legacy/default prompt strict control: 208/208 cells valid; best route 0.82344 domain-avg reward, below the procedure-shaped headline 0.84688.
  • Latest headline-route recheck: 16/16 cells valid; retail 0.89375, airline 0.81250, domain-avg reward 0.85313.
  • Success+failure outcome-label-only risk checks: see the ablation table above; the fixed strongest-route check completed 32/32 cells valid and stayed below the success-only headline.

Thanks @yangxinxin-7 for the TAU-2 benchmark scaffold in #2003; this PR keeps using that workflow rather than adding a separate eval path.

@github-actions
Copy link
Copy Markdown

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

🎫 Ticket compliance analysis ✅

2003 - Fully compliant

Compliant requirements:

  • Add TAU-2 trajectory config
  • Add search_memory_type parameter and validation
  • Refine trajectory extraction instruction and template
  • Update prompts to use neutral wording
⏱️ Estimated effort to review: 2 🔵🔵⚪⚪⚪
🏅 Score: 95
🧪 No relevant tests
🔒 No security concerns identified
✅ No TODO sections
🔀 No multiple PR themes
⚡ No major issues detected

@github-actions
Copy link
Copy Markdown

PR Code Suggestions ✨

No code suggestions found for the PR.

@huangruiteng huangruiteng force-pushed the feat/tau2-trajectory-memory branch from 9cfe362 to c2228a2 Compare May 13, 2026 16:12
@huangruiteng huangruiteng force-pushed the feat/tau2-trajectory-memory branch from c2228a2 to 2b767f2 Compare May 13, 2026 17:03
@huangruiteng huangruiteng changed the title feat(memory): add TAU-2 trajectory-view treatment feat(memory): extract TAU-2 trajectories into procedure memories May 20, 2026
@huangruiteng huangruiteng changed the title feat(memory): extract TAU-2 trajectories into procedure memories feat(memory): upgrade trajectory extraction to beat TAU-2 no-memory May 20, 2026
@huangruiteng huangruiteng changed the title feat(memory): upgrade trajectory extraction to beat TAU-2 no-memory feat(memory): upgrade trajectory extraction to beat no-memory baseline May 20, 2026
@yangxinxin-7 yangxinxin-7 merged commit 520713c into volcengine:main May 21, 2026
5 checks passed
@github-project-automation github-project-automation Bot moved this from Backlog to Done in OpenViking project May 21, 2026
@huangruiteng
Copy link
Copy Markdown
Contributor Author

Follow-up PR for the scope-prompt fairness read: #2172.

Why this exists: the strongest PR-B result used the TAU-2 domain scope prompt, and a same-scope no-memory rerun showed that the scope prompt itself can move the baseline. #2172 adds the missing no-memory scope wiring plus a generic advisory-memory scope prompt/config so the PR-B uplift can be read in two clean ways:

  • domain-specific scope: headline avg 0.85313 vs same-scope no-memory 0.81719, conservative delta +0.03594
  • generic scope: trajectory memory avg 0.84219 vs generic-scope no-memory 0.79844, delta +0.04375

This does not change the main PR-B mechanism; it makes the attribution boundary clearer and gives us a benchmark-neutral scope option.

@huangruiteng
Copy link
Copy Markdown
Contributor Author

Follow-up after the PR-B merge: opened #2255 to make trajectory retrieval use a compact trajectory_name + retrieval_anchor embedding surface.

Clean TAU-2 full8 read, same fixed-first-user + generic scope + trajectory top4 + pre-write top2 protocol:

Setting Retail Airline Task-weighted reward
current master trajectory prompt 0.83750 0.74375 0.80625
pre-#2221 trajectory prompt 0.86250 0.75625 0.82708
name + retrieval anchor (#2255) 0.88438 0.78750 0.85208

This keeps the change narrow: no category rerank, controller, failure overlay, or runner filter; just a better positive applicability text for trajectory memory vector search.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

3 participants