feat(llm,timeline): cross-turn reasoning state replay for gpt-5/o3/o4 by ngoclam9415 · Pull Request #10 · aitomatic/dana-runtime

ngoclam9415 · 2026-05-10T14:21:41Z

Summary

Persist raw reasoning items from the OpenAI Responses API in TimelineEntry.metadata. On the next turn, splice them back into input[] when provider:model:endpoint matches, so gpt-5+ retains structured reasoning state across turns instead of re-deriving it from flattened assistant text.

Stacks on #9 — diff includes its 5 commits (Responses API plumbing, env-driven effort, agent reasoning persistence). When #9 merges, rebase this PR.

What changes

Layer	Change
`LLMResponse`	+ `reasoning_items: list[dict] \| None` and `response_id`
`LLMMessage`	+ carriers `reasoning_items` / `reasoning_fingerprint` / `response_id`
`OpenAICompatibleProvider`	Captures raw items via input-side-only field whitelist; opts in to `include=["reasoning.encrypted_content"]` with sticky 400 fallback; `endpoint_hash` + `fingerprint` properties
`_chat_via_responses`	Stale-id / item-shape rejection fallback (one-shot strip-and-retry); never silently kills turns
`_convert_to_responses_input`	Splices raw items before assistant message when fingerprint matches; `LLM_REASONING_REPLAY=0` kill switch (default ON)
`Timeline.to_llm_messages`	Propagates entry metadata onto assistant messages; merge-skip when items present (preserves provenance)
`STARAgent._record_think_results`	Writes `{reasoning_items, fingerprint, response_id}` into `AGENT_THOUGHTS.metadata`
`NativeMessage.to_llm_message`	Propagates metadata fields — closes the CompressedTimeline gap
`compression_engine`	Summary metadata composed explicitly so reasoning state never leaks past the boundary

Architecture

Capture (Phase 1):
  Responses API output[] → reasoning_items list[dict] (whitelisted fields)
                        → encrypted_content opportunistic when ZDR-enabled

Persist (Phase 2):
  AGENT_THOUGHTS.metadata = {reasoning_items, fingerprint, response_id}
  fingerprint = "{provider}:{model_family}:{sha256(endpoint_url)[:8]}"

Replay (Phase 3):
  Timeline → LLMMessage carriers → openai_messages dict carrier keys
          → splice on fingerprint match → Responses API input[]
  Order: [reasoning items] → [assistant w/ tool_calls] → [function_call_output]

Compression (Phase 4):
  Compressed-away entries discarded entirely; summary metadata composed
  fresh — reasoning items never survive the boundary.

Counter-intuitive finding (live verify)

Replay does NOT save tokens in steady-state. When prior reasoning state is present, gpt-5 reasons every turn (~200-300 tokens per accumulated item). Without replay, the model nondeterministically skips reasoning, sometimes saving tokens but losing continuity.

Real benefit:

Reasoning continuity — model carries structured state, not lossy summary text
Reasoning consistency — every turn reasons when state is provided
Cost benefit materializes only with ZDR/encrypted_content (encrypted blob is smaller than equivalent summary text)

Records: plans/260507-1829-reasoning-state-replay/reports/

Critical bug found and fixed during verify

model_dump() was leaking output-only fields (status, content) into replayed input[]. Azure rejected with unknown_parameter: input[N].status, silently killing all subsequent turns. Now whitelisted to {type, id, summary, encrypted_content} + stale-id rejection fallback for graceful degradation.

Operator runbook

# Disable replay globally (fall back to flat-text history)
LLM_REASONING_REPLAY=0

# Verify a session has captured items
cat .dana/dana_agent/<id>/sessions/<sid>/timeline.json \
  | jq '.entries[] | select(.type=="agent_thoughts") | .metadata.fingerprint'

# Expected log line on replay
# reasoning replay items=N fingerprint=azure:gpt-5:abcd1234

Resume from disk (one ergonomics gotcha): DanaCodingAgent doesn't auto-resume sessions. To continue a persisted conversation in a fresh process:

import json
agent = DanaCodingAgent(agent_id="my-agent", llm_provider="azure", model="gpt-5.2")
data = json.loads(open(".dana/dana_agent/my-agent/sessions/<sid>/timeline.json").read())
agent._timeline.load_from_entries(entries=data["entries"])
# Replay fires automatically on subsequent calls

Tests

333/333 unit tests green:

tests/unit/llm/test_reasoning_items_capture.py (14) — capture path, encrypted_content presence/absence, include-flag fallback, endpoint hash, fingerprint format
tests/unit/core/agent/test_thinking_metadata_persistence.py (11) — _build_thinking_metadata, _provider_fingerprint defensive lookup, both _record_think_results branches, JSON round-trip, isolation
tests/unit/llm/test_reasoning_replay.py (17) — splice match/mismatch, kill switch (incl. falsy values), tool-call ordering, multi-turn, carrier stripping, Timeline.to_llm_messages propagation, merge-skip
tests/unit/test_reasoning_compression_policy.py (7) — NativeMessage propagation, CompressedTimeline end-to-end, compressed-away entries vanish, kept entries retain items, no-keep edge case
tests/unit/test_reasoning_replay_disk_resume.py (4) — legacy + native format disk-resume, cross-provider mismatch, kill switch on resumed timeline

Live verify (Azure gpt-5.2):

scripts/verify-reasoning-replay.py — 4-turn × 2-run (ON/OFF), splice fires confirmed, kill switch confirmed
scripts/verify-disk-resume-replay.py — fresh process loads 3-item timeline, replays all 3 to input[]

Test plan

Local: uv run pytest tests/unit -q → all green
Live in-process: uv run python scripts/verify-reasoning-replay.py → 4 hard PASS
Live disk-resume: uv run python scripts/verify-disk-resume-replay.py → 3 hard PASS
Smoke gpt-5.2 long session in dev env, then flip LLM_REASONING_REPLAY=0 and confirm flat-text path still works

Worth flagging

Token cost grows ~200-300 per accumulated item with replay. Net benefit is reasoning continuity, not cost. Flip to ZDR for cost win.
DanaCodingAgent doesn't auto-resume from disk. Pre-existing behavior, not introduced here, but consumers of replay will hit it. Future ergonomics PR could add resume_from_session_id.
Long sessions grow timeline.json since reasoning items persist alongside summary text. Compression drops them from the summary boundary, but steady-state size still increases. Monitor .dana/**/timeline.json size in prod.

…es 1-4) Persist raw reasoning items from the OpenAI Responses API in TimelineEntry.metadata; on subsequent calls splice them back into input[] when provider:model:endpoint matches, so gpt-5+ retains structured reasoning state across turns instead of re-deriving it from flattened assistant text. Reduces token churn and improves reasoning continuity. Native-tool codec only. Provider capture (Phase 1): - LLMResponse.reasoning_items + response_id (raw output[] items) - _chat_via_responses opts in to include=reasoning.encrypted_content with sticky one-shot 400 fallback for accounts/api-versions that reject the flag - endpoint_hash + fingerprint properties on Azure / OpenAI providers (sha256 of base_url, first 8 chars) Agent persistence (Phase 2): - ParsedResponse carries reasoning_items + response_id; both codecs populate them - star_agent _record_think_results writes {reasoning_items, fingerprint, response_id} into AGENT_THOUGHTS metadata on the reasoning entry only (not the response/tool_call entries) Codec replay (Phase 3): - LLMMessage gains reasoning_items / reasoning_fingerprint / response_id carriers - Timeline.to_llm_messages propagates entry metadata onto assistant messages; merge step skips when items present so 1:1 fingerprint provenance is preserved - prepare_messages threads carriers via underscore-prefixed keys - _convert_to_responses_input splices raw items before assistant message when fingerprint matches the active provider; carriers always stripped before API call - LLM_REASONING_REPLAY=0 kill switch (default ON, env-resolved per call) Compression policy (Phase 4): - NativeMessage.to_llm_message propagates reasoning fields - closes a Phase 3 gap where CompressedTimeline-routed traffic silently dropped them - compression_engine.compress composes summary metadata from a single explicit dict so reasoning items never survive the boundary Tests: 49 new unit tests across capture, agent persistence, codec replay (match/mismatch/kill switch/tool-call ordering/multi-turn), and compression policy. 422/422 across LLM + core + compression suites green.

…ale-id fallback Surfaced by Phase 5 live verify (Azure gpt-5.2): replaying reasoning items captured via model_dump() caused HTTP 400 because output-side fields like 'status' aren't valid on input. Without a fallback, the rejection silently killed every subsequent turn (user_message saved, no agent_response). Two fixes: 1. _serialize_output_item now whitelists input-side fields only: type, id, summary, encrypted_content. No more model_dump() leak of output-only metadata into replayed input[]. 2. _chat_via_responses gains a one-shot stale-id / item-shape rejection fallback. On BadRequestError matching the rejection heuristic, strip reasoning items from input[] and retry once. Degrades to flat-text replay rather than failing the user's turn entirely. Verify script (scripts/verify-reasoning-replay.py) confirms: - replay fires when ON (3 fires across turns 2-4) - kill switch (LLM_REASONING_REPLAY=0) cleanly disables replay - reasoning_items accumulate cumulatively turn-over-turn - all 4 turns complete cleanly with replay enabled - 329/329 unit tests still green Findings recorded in plans/.../verify-260507-1905-reasoning-replay.md: replay does NOT save tokens in steady-state - model reasons more when prior state is present (~200-300 tokens per accumulated item). Real benefit is reasoning continuity, not cost. Cost benefit materializes when paired with ZDR/encrypted_content (smaller payload).

User flagged the gap: Phase 5 verified in-process multi-turn replay, but did not exercise the fresh-process disk-load → resume → replay path. This commit closes it with both unit tests and a live verify. Findings: 1. The replay machinery works end-to-end across a process boundary. Persisted timeline.json round-trips through the legacy + native load formats, metadata survives, items splice into Responses API input[] when fingerprint matches. 2. DanaCodingAgent does NOT auto-resume from disk - each new process gets a fresh empty session. Resume requires explicit re-injection of persisted entries via agent._timeline.load_from_entries(...). Documented in the verify report. Tests (4): - Legacy-format disk resume → splice fires with persisted items - Native-format disk resume → NativeMessage.from_dict preserves metadata.reasoning_items end-to-end - Cross-provider fingerprint mismatch → no replay (carriers stripped) - Kill switch overrides matching fingerprint after resume Live verify (scripts/verify-disk-resume-replay.py): fresh process loaded a 3-item timeline, replayed all 3 into input[] on the next turn (1 splice fire), LLM call completed. 333/333 unit tests pass.

ngoclam9415 added 3 commits May 10, 2026 21:24

ngoclam9415 force-pushed the feat/reasoning-state-replay branch from 2f11174 to b9a9a79 Compare May 10, 2026 14:24

ngoclam9415 merged commit 799959f into develop May 10, 2026
1 check failed

ngoclam9415 mentioned this pull request May 18, 2026

fix(timeline,tools): GPT-5 timeline robustness — reasoning-item drop + tool-batch isolation #11

Merged

TheVinhLuong102 deleted the feat/reasoning-state-replay branch May 21, 2026 22:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(llm,timeline): cross-turn reasoning state replay for gpt-5/o3/o4#10

feat(llm,timeline): cross-turn reasoning state replay for gpt-5/o3/o4#10
ngoclam9415 merged 3 commits into
developfrom
feat/reasoning-state-replay

ngoclam9415 commented May 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ngoclam9415 commented May 10, 2026

Summary

What changes

Architecture

Counter-intuitive finding (live verify)

Critical bug found and fixed during verify

Operator runbook

Tests

Test plan

Worth flagging

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant