feat(memory): upgrade trajectory extraction to beat no-memory baseline by huangruiteng · Pull Request #2017 · volcengine/OpenViking

huangruiteng · 2026-05-13T09:56:49Z

Status

Follow-up to #2003. This PR upgrades OpenViking's default trajectory extraction protocol: instead of leaving a training trajectory as a raw conversation-shaped record, Memory V2 now extracts compact procedure memories with operation intent, applicability boundaries, write-field provenance, and anti-patterns.

TAU-2 is the validation battleground for this protocol change, not the whole scope of the idea. The broader goal is to make future agent trajectories reusable as decision-time procedure memory across tasks; TAU-2 gives us a strict held-out tool-use benchmark to prove the memory shape is useful.

Headline: on full TAU-2 8-trial evaluation over retail + airline, the optimized trajectory-memory route clearly beats the no-memory baseline: 0.85313 vs 0.80156 domain-avg reward (+0.05157; retail 0.89375 / airline 0.81250). The complete content-shape matrix establishes the same route at 0.84688 before the latest fixed-protocol recheck, so the PR now reports 0.85313 as the headline best score.

Route	Retail reward	Airline reward	Domain-avg reward	Delta vs paired no-memory	Read
No-memory reference	0.84688	0.75625	0.80156	-	Same held-out 8-trial protocol; airline uses the refreshed PR-B no-memory rerun.
Legacy/default trajectory memory, top4, first-user + pre-write + scope	0.84375	0.74688	0.79531	+0.00156	Same route as the optimized headline, but with the older `Goal / Trajectory / Result / Fail reason` prompt; trajectory recall is effectively flat.
Legacy/default experience memory, best experience route	0.83125	0.79375	0.81250	+0.01875	Strongest old-prompt experience control: 4000-char pre-write-only; useful but still below optimized trajectory memory.
Optimized trajectory memory, top4, first-user + pre-write + scope	0.89375	0.81250	0.85313	+0.05157	Current strongest PR-B evidence; fixed protocol, scope enabled, no category/selector/controller/failure variant.

A legacy/default prompt control now makes the mechanism clearer: when the same success-only corpus setup is rebuilt with the older Goal / Trajectory / Result / Fail reason trajectory prompt, the same trajectory top4 first-user + pre-write route reaches only 0.79531 domain-avg reward. The old-prompt experience route is stronger than old trajectory recall, with a best experience control at 0.81250, but it still trails the optimized procedure-shaped trajectory route. The main lift is therefore not explained by success-only training, transcript replay format, or a memory budget alone.

Claim boundary: these are TAU-2 held-out full8 reward scores: full test split, 8 repeats, retail + airline, reasoning-high, fixed first user where applicable. The headline content-shape matrix completed with 144/144 valid cells and no nonzero return codes; the latest headline-route recheck completed 16/16 valid cells. The legacy/default prompt control completed with 208/208 valid cells. Category rerank and other diagnostic controls stay outside this PR claim.

Known follow-up: the coarse lifecycle label is intentionally coarse. It is useful in this PR as a lifecycle boundary, but a more general trajectory protocol should eventually split it into orthogonal fields such as effect type, object lifecycle, terminality, and failure / anti-pattern type, then validate that broader schema beyond TAU-2 before changing the default.

What Changed

Refines the trajectory extraction prompt into a procedure-memory protocol: operation intent, preconditions, immutable-object boundary, procedure, write-field provenance, anti-patterns, applicability, and negative applicability.
Uses coarse lifecycle labels as lifecycle boundaries rather than benchmark task categories; their job is to keep one memory from blending unrelated reads, writes, handoffs, and final responses.
Allows a session to produce multiple reusable trajectory memories when it contains separate intents or lifecycle transitions, instead of forcing one whole-session memory.
Keeps the memory generally reusable: concrete IDs, payment details, exact dates, and case-specific values should become evidence only when needed, not the trigger for future recall.
Adds TAU-2 runner support for:
- trajectory vs experience retrieval selection;
- first-user, pre-write, and combined retrieval nodes;
- scope prompt files;
- reusable corpus preparation with corpus identity checks;
- fixed-count injection and optional injected-memory character budgets;
- safe corpus reuse across experience/trajectory buckets by rebuilding the search URI from the current strategy memory type.
Adds benchmark/tau2/config/prb_content_matrix_new_prompt.yaml so reviewers can reproduce the no-memory control, trajectory top4, experience top2, and 4000-character budget ablation from one config.
Removes diagnostic controls from the PR-B benchmark path; selector, controller, category rerank, and failure-reflection variants are archived for separate experiments.

Evidence

All numbers below are TAU-2 held-out test reward scores: 8 repeats, retail + airline, fixed first user where applicable, reasoning-high. DB-match scoreboards are preserved in the run artifacts, but reward is the primary metric.

Main no-category trajectory routes

Route	Retail	Airline	Domain avg	Delta vs no-memory	Notes
No-memory reference	0.84688	0.75625	0.80156	-	Baseline for this PR description.
Trajectory top4, first-user + scope	0.88750	0.78125	0.83438	+0.03281	Strong retail lift and positive airline lift.
Trajectory top4, pre-write-only + scope	0.87500	0.78125	0.82813	+0.02656	Positive two-domain average.
Trajectory top4, first-user + pre-write + scope, headline recheck	0.89375	0.81250	0.85313	+0.05157	Current strongest PR-B evidence.

Content-shape ablation

Shape	Best route	Retail	Airline	Domain avg	Read
Trajectory fixed top4	first-user + pre-write	0.88750	0.80625	0.84688	Headline candidate.
Experience fixed top2	first-user + pre-write	0.88438	0.76250	0.82344	Retail helps; airline is near baseline.
Trajectory 4000-char	pre-write-only	0.85938	0.77500	0.81719	Budget usually injects only one trajectory.
Experience 4000-char	first-user + pre-write	0.85625	0.76875	0.81250	Weaker than fixed-count; avg injected is about 1.3-2.0 memories depending on node/domain.

The content-shape matrix completed with 144/144 valid cells and 0 nonzero return codes; the PR config now reproduces that same 9-strategy evidence grid. The result supports fixed-count procedure-shaped trajectory memory as the PR-B default; experience and 4000-character budget variants remain useful ablations, but not the headline route.

Legacy/default prompt control

This control keeps the same success-only and TAU-2 replay setup, but uses the older trajectory prompt shape instead of the procedure/action-boundary protocol. It is included to separate the prompt/protocol contribution from corpus hygiene and budget effects.

The no-memory rows come from each matrix's own paired control, so the most useful read is both the absolute domain average and the delta against that matrix's no-memory baseline.

Shape	Route	Legacy/default prompt avg	Legacy delta	Procedure-shaped prompt avg	Procedure delta	New - legacy	Read
No-memory	-	0.79375	-	0.80156	-	+0.00781	Separate paired controls; not the claimed prompt effect.
Trajectory fixed top4	first-user	0.79688	+0.00313	0.83437	+0.03281	+0.03749	Procedure shape makes first-turn recall useful.
Trajectory fixed top4	pre-write-only	0.80781	+0.01406	0.82812	+0.02656	+0.02031	Both positive; procedure shape is stronger.
Trajectory fixed top4	first-user + pre-write	0.79531	+0.00156	0.84688	+0.04531	+0.05157	Headline PR-B route.
Experience fixed top2	first-user	0.80312	+0.00938	0.81406	+0.01250	+0.01094	Small positive control; not headline.
Experience fixed top2	pre-write-only	0.79844	+0.00469	0.82031	+0.01875	+0.02187	Procedure branch is stronger.
Experience fixed top2	first-user + pre-write	0.78125	-0.01250	0.82344	+0.02188	+0.04219	Legacy combo hurts; procedure combo helps.
Trajectory 4000-char	first-user	0.79375	+0.00000	0.81563	+0.01406	+0.02188	Budgeted first-user remains weaker than fixed top4.
Trajectory 4000-char	pre-write-only	0.82344	+0.02969	0.81719	+0.01562	-0.00625	Best legacy route, but still below procedure fixed top4 combo.
Trajectory 4000-char	first-user + pre-write	0.81094	+0.01719	0.82031	+0.01875	+0.00937	Both positive but below headline.
Experience 4000-char	first-user	0.77812	-0.01562	0.80000	-0.00156	+0.02188	Neither should be headline.
Experience 4000-char	pre-write-only	0.81250	+0.01875	0.80625	+0.00469	-0.00625	Legacy budget has a small edge here, but route is not best overall.
Experience 4000-char	first-user + pre-write	0.77969	-0.01406	0.81250	+0.01094	+0.03281	Procedure branch avoids the legacy combo regression.

The legacy/default prompt control completed with 208/208 valid cells. The complete comparison is intentionally not one-sided: legacy 4000-character pre-write has a couple of small wins, but the fixed-count trajectory routes, especially first-user + pre-write, are much stronger with the procedure/action-boundary extraction protocol. This supports keeping the extraction-protocol change as the center of the PR rather than treating the result as a corpus-only or budget-only effect.

Success + failure outcome-label ablation

This is a risk check for reviewers who worry that the new extraction prompt might only be safe on successful trajectories. I rebuilt success+failure corpora where failed training sessions carry only a minimal outcome label, not TAU-2 reward assertions or evaluator feedback. The result does not beat the success-only headline, so this PR keeps the default corpus success-only and leaves failure memories for a separate anti-pattern / root-cause compression design.

Prompt / corpus	Route	Retail	Airline	Domain avg	Read
Optimized trajectory prompt, success-only	trajectory top4, first-user + pre-write	0.89375	0.81250	0.85313	PR-B headline; fixed protocol, 16/16 valid cells.
Optimized trajectory prompt, success+failure label-only	trajectory top4, first-user + pre-write	0.85000	0.76875	0.80938	Fixed risk check, 32/32 valid cells; below success-only.
Optimized trajectory prompt, success+failure label-only	experience top2, first-user	0.84688	0.71875	0.78281	Fixed risk check; experience-only also does not recover the gap.
Legacy/default prompt, success+failure label-only	trajectory top4, first-user + pre-write	0.81875	0.72500	0.77188	Diagnostic old-prompt control; lower than the procedure-shaped prompt.
Legacy/default prompt, success+failure label-only	experience top2, first-user	0.81250	0.70625	0.75938	Diagnostic old-prompt control; kept as a failure-corpus sanity check.

Reproduce

After TAU-2, model credentials, and a clean local OpenViking service are configured, run a tiny wiring smoke:

benchmark/tau2/run_full_eval.sh \
  --config benchmark/tau2/config/prb_content_matrix_new_prompt.yaml \
  --domain retail \
  --strategy-id new_traj_fixed_first_user_prewrite \
  --num-tasks 1 \
  --train-num-tasks 1 \
  --repeat-count 1 \
  --strict-preflight \
  --execute

Run the full PR-B evidence matrix:

benchmark/tau2/run_full_eval.sh \
  --config benchmark/tau2/config/prb_content_matrix_new_prompt.yaml \
  --run-id prb_content_matrix_new_prompt_full8 \
  --strict-preflight \
  --execute

The main aggregate is benchmark/tau2/result/prb_content_matrix_new_prompt_full8/scoreboard.json; per-cell details are in cell_results/, and corpus identity / generated memory checks are in memory_corpora/.

Validation

git diff --check
sensitive-token coarse scan over the staged diff
/Users/bytedance/Documents/OpenViking-pr-b-trajectory/.venv/bin/python -m py_compile benchmark/tau2/scripts/run_eval.py benchmark/tau2/scripts/run_memory_v2_eval.py openviking/session/memory/agent_trajectory_context_provider.py
Plan-only check for benchmark/tau2/config/prb_content_matrix_new_prompt.yaml: 144 cells planned, 144 executable, char-budget flags emitted for the 4000-character variants.
Fresh airline no-memory full8 baseline: reward 0.75625.
PR-B content-shape matrix: 144/144 cells valid, all returncode 0.
Legacy/default prompt strict control: 208/208 cells valid; best route 0.82344 domain-avg reward, below the procedure-shaped headline 0.84688.
Latest headline-route recheck: 16/16 cells valid; retail 0.89375, airline 0.81250, domain-avg reward 0.85313.
Success+failure outcome-label-only risk checks: see the ablation table above; the fixed strongest-route check completed 32/32 cells valid and stayed below the success-only headline.

Thanks @yangxinxin-7 for the TAU-2 benchmark scaffold in #2003; this PR keeps using that workflow rather than adding a separate eval path.

github-actions · 2026-05-13T09:58:00Z

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

🎫 Ticket compliance analysis ✅ 2003 - Fully compliant Compliant requirements: Add TAU-2 trajectory config Add search_memory_type parameter and validation Refine trajectory extraction instruction and template Update prompts to use neutral wording
⏱️ Estimated effort to review: 2 🔵🔵⚪⚪⚪
🏅 Score: 95
🧪 No relevant tests
🔒 No security concerns identified
✅ No TODO sections
🔀 No multiple PR themes
⚡ No major issues detected

github-actions · 2026-05-13T09:59:42Z

PR Code Suggestions ✨

No code suggestions found for the PR.

…memory

huangruiteng · 2026-05-21T10:31:06Z

Follow-up PR for the scope-prompt fairness read: #2172.

Why this exists: the strongest PR-B result used the TAU-2 domain scope prompt, and a same-scope no-memory rerun showed that the scope prompt itself can move the baseline. #2172 adds the missing no-memory scope wiring plus a generic advisory-memory scope prompt/config so the PR-B uplift can be read in two clean ways:

domain-specific scope: headline avg 0.85313 vs same-scope no-memory 0.81719, conservative delta +0.03594
generic scope: trajectory memory avg 0.84219 vs generic-scope no-memory 0.79844, delta +0.04375

This does not change the main PR-B mechanism; it makes the attribution boundary clearer and gives us a benchmark-neutral scope option.

huangruiteng · 2026-05-26T20:19:29Z

Follow-up after the PR-B merge: opened #2255 to make trajectory retrieval use a compact trajectory_name + retrieval_anchor embedding surface.

Clean TAU-2 full8 read, same fixed-first-user + generic scope + trajectory top4 + pre-write top2 protocol:

Setting	Retail	Airline	Task-weighted reward
current master trajectory prompt	0.83750	0.74375	0.80625
pre-#2221 trajectory prompt	0.86250	0.75625	0.82708
name + retrieval anchor (#2255)	0.88438	0.78750	0.85208

This keeps the change narrow: no category rerank, controller, failure overlay, or runner filter; just a better positive applicability text for trajectory memory vector search.

feat(benchmark): add TAU-2 trajectory memory treatment

d1caa77

github-project-automation Bot added this to OpenViking project May 13, 2026

github-project-automation Bot moved this to Backlog in OpenViking project May 13, 2026

huangruiteng added 5 commits May 13, 2026 18:31

style(benchmark): format tau2 trajectory scripts

a68d5e7

Merge remote-tracking branch 'origin/main' into feat/tau2-trajectory-…

0700781

…memory

refine trajectory memory view prompt

fddd7ba

feat(benchmark): prepare tau2 memory corpora before eval

0496000

fix(benchmark): tighten trajectory evidence prompt

08d33a9

huangruiteng force-pushed the feat/tau2-trajectory-memory branch from 9cfe362 to c2228a2 Compare May 13, 2026 16:12

fix(benchmark): guard tau2 infrastructure failures

2b767f2

huangruiteng force-pushed the feat/tau2-trajectory-memory branch from c2228a2 to 2b767f2 Compare May 13, 2026 17:03

huangruiteng added 7 commits May 14, 2026 01:54

fix(benchmark): resolve tau2 runner paths

63a1004

fix(memory): add trajectory evidence examples

fb62c46

fix(benchmark): run no-memory tau2 eval in process

e30b79b

bench(tau2): align retrieval budget and fixed first user

662cf0b

bench(tau2): reuse memory corpora across eval runs

f85d60b

bench(tau2): add scoped trajectory eval concurrency

d833980

style(benchmark): format tau2 eval runner

74c18db

huangruiteng mentioned this pull request May 14, 2026

bench(tau2): add category-aware memory rerank treatment #2044

Closed

style(benchmark): satisfy tau2 eval lint

8cf7737

MaojiaSheng approved these changes May 15, 2026

View reviewed changes

huangruiteng mentioned this pull request May 15, 2026

bench(tau2): add category-aware trajectory memory rerank #2079

Draft

4 tasks

huangruiteng added 3 commits May 20, 2026 02:00

bench(tau2): harden trajectory memory eval variants

d8297c6

fix(tau2): rebuild search URI for reused corpora

5a4f679

docs(tau2): add PR-B reproduction commands

f135e7f

huangruiteng changed the title ~~feat(memory): add TAU-2 trajectory-view treatment~~ feat(memory): extract TAU-2 trajectories into procedure memories May 20, 2026

huangruiteng changed the title ~~feat(memory): extract TAU-2 trajectories into procedure memories~~ feat(memory): upgrade trajectory extraction to beat TAU-2 no-memory May 20, 2026

huangruiteng changed the title ~~feat(memory): upgrade trajectory extraction to beat TAU-2 no-memory~~ feat(memory): upgrade trajectory extraction to beat no-memory baseline May 20, 2026

huangruiteng added 4 commits May 20, 2026 10:27

chore(tau2): keep PR-B benchmark scope focused

70031a4

chore(tau2): keep trajectory prompt generic

b41cf4d

chore(tau2): format benchmark scripts

1cdfead

bench(tau2): restore operation-family trajectory protocol

d65a2f9

yangxinxin-7 merged commit 520713c into volcengine:main May 21, 2026
5 checks passed

github-project-automation Bot moved this from Backlog to Done in OpenViking project May 21, 2026

huangruiteng mentioned this pull request May 21, 2026

bench(tau2): add generic scope fairness check #2172

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(memory): upgrade trajectory extraction to beat no-memory baseline#2017

feat(memory): upgrade trajectory extraction to beat no-memory baseline#2017
yangxinxin-7 merged 22 commits into
volcengine:mainfrom
huangruiteng:feat/tau2-trajectory-memory

huangruiteng commented May 13, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 13, 2026

Uh oh!

github-actions Bot commented May 13, 2026

Uh oh!

Uh oh!

huangruiteng commented May 21, 2026

Uh oh!

huangruiteng commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

huangruiteng commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Status

What Changed

Evidence

Main no-category trajectory routes

Content-shape ablation

Legacy/default prompt control

Success + failure outcome-label ablation

Reproduce

Validation

Uh oh!

github-actions Bot commented May 13, 2026

PR Reviewer Guide 🔍

Uh oh!

github-actions Bot commented May 13, 2026

PR Code Suggestions ✨

Uh oh!

Uh oh!

huangruiteng commented May 21, 2026

Uh oh!

huangruiteng commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

huangruiteng commented May 13, 2026 •

edited

Loading