Skip to content

feat(llm,timeline): cross-turn reasoning state replay for gpt-5/o3/o4#10

Merged
ngoclam9415 merged 3 commits into
developfrom
feat/reasoning-state-replay
May 10, 2026
Merged

feat(llm,timeline): cross-turn reasoning state replay for gpt-5/o3/o4#10
ngoclam9415 merged 3 commits into
developfrom
feat/reasoning-state-replay

Conversation

@ngoclam9415
Copy link
Copy Markdown
Contributor

Summary

Persist raw reasoning items from the OpenAI Responses API in TimelineEntry.metadata. On the next turn, splice them back into input[] when provider:model:endpoint matches, so gpt-5+ retains structured reasoning state across turns instead of re-deriving it from flattened assistant text.

Stacks on #9 — diff includes its 5 commits (Responses API plumbing, env-driven effort, agent reasoning persistence). When #9 merges, rebase this PR.

What changes

Layer Change
LLMResponse + reasoning_items: list[dict] | None and response_id
LLMMessage + carriers reasoning_items / reasoning_fingerprint / response_id
OpenAICompatibleProvider Captures raw items via input-side-only field whitelist; opts in to include=["reasoning.encrypted_content"] with sticky 400 fallback; endpoint_hash + fingerprint properties
_chat_via_responses Stale-id / item-shape rejection fallback (one-shot strip-and-retry); never silently kills turns
_convert_to_responses_input Splices raw items before assistant message when fingerprint matches; LLM_REASONING_REPLAY=0 kill switch (default ON)
Timeline.to_llm_messages Propagates entry metadata onto assistant messages; merge-skip when items present (preserves provenance)
STARAgent._record_think_results Writes {reasoning_items, fingerprint, response_id} into AGENT_THOUGHTS.metadata
NativeMessage.to_llm_message Propagates metadata fields — closes the CompressedTimeline gap
compression_engine Summary metadata composed explicitly so reasoning state never leaks past the boundary

Architecture

Capture (Phase 1):
  Responses API output[] → reasoning_items list[dict] (whitelisted fields)
                        → encrypted_content opportunistic when ZDR-enabled

Persist (Phase 2):
  AGENT_THOUGHTS.metadata = {reasoning_items, fingerprint, response_id}
  fingerprint = "{provider}:{model_family}:{sha256(endpoint_url)[:8]}"

Replay (Phase 3):
  Timeline → LLMMessage carriers → openai_messages dict carrier keys
          → splice on fingerprint match → Responses API input[]
  Order: [reasoning items] → [assistant w/ tool_calls] → [function_call_output]

Compression (Phase 4):
  Compressed-away entries discarded entirely; summary metadata composed
  fresh — reasoning items never survive the boundary.

Counter-intuitive finding (live verify)

Replay does NOT save tokens in steady-state. When prior reasoning state is present, gpt-5 reasons every turn (~200-300 tokens per accumulated item). Without replay, the model nondeterministically skips reasoning, sometimes saving tokens but losing continuity.

Real benefit:

  • Reasoning continuity — model carries structured state, not lossy summary text
  • Reasoning consistency — every turn reasons when state is provided
  • Cost benefit materializes only with ZDR/encrypted_content (encrypted blob is smaller than equivalent summary text)

Records: plans/260507-1829-reasoning-state-replay/reports/

Critical bug found and fixed during verify

model_dump() was leaking output-only fields (status, content) into replayed input[]. Azure rejected with unknown_parameter: input[N].status, silently killing all subsequent turns. Now whitelisted to {type, id, summary, encrypted_content} + stale-id rejection fallback for graceful degradation.

Operator runbook

# Disable replay globally (fall back to flat-text history)
LLM_REASONING_REPLAY=0

# Verify a session has captured items
cat .dana/dana_agent/<id>/sessions/<sid>/timeline.json \
  | jq '.entries[] | select(.type=="agent_thoughts") | .metadata.fingerprint'

# Expected log line on replay
# reasoning replay items=N fingerprint=azure:gpt-5:abcd1234

Resume from disk (one ergonomics gotcha): DanaCodingAgent doesn't auto-resume sessions. To continue a persisted conversation in a fresh process:

import json
agent = DanaCodingAgent(agent_id="my-agent", llm_provider="azure", model="gpt-5.2")
data = json.loads(open(".dana/dana_agent/my-agent/sessions/<sid>/timeline.json").read())
agent._timeline.load_from_entries(entries=data["entries"])
# Replay fires automatically on subsequent calls

Tests

333/333 unit tests green:

  • tests/unit/llm/test_reasoning_items_capture.py (14) — capture path, encrypted_content presence/absence, include-flag fallback, endpoint hash, fingerprint format
  • tests/unit/core/agent/test_thinking_metadata_persistence.py (11) — _build_thinking_metadata, _provider_fingerprint defensive lookup, both _record_think_results branches, JSON round-trip, isolation
  • tests/unit/llm/test_reasoning_replay.py (17) — splice match/mismatch, kill switch (incl. falsy values), tool-call ordering, multi-turn, carrier stripping, Timeline.to_llm_messages propagation, merge-skip
  • tests/unit/test_reasoning_compression_policy.py (7) — NativeMessage propagation, CompressedTimeline end-to-end, compressed-away entries vanish, kept entries retain items, no-keep edge case
  • tests/unit/test_reasoning_replay_disk_resume.py (4) — legacy + native format disk-resume, cross-provider mismatch, kill switch on resumed timeline

Live verify (Azure gpt-5.2):

  • scripts/verify-reasoning-replay.py — 4-turn × 2-run (ON/OFF), splice fires confirmed, kill switch confirmed
  • scripts/verify-disk-resume-replay.py — fresh process loads 3-item timeline, replays all 3 to input[]

Test plan

  • Local: uv run pytest tests/unit -q → all green
  • Live in-process: uv run python scripts/verify-reasoning-replay.py → 4 hard PASS
  • Live disk-resume: uv run python scripts/verify-disk-resume-replay.py → 3 hard PASS
  • Smoke gpt-5.2 long session in dev env, then flip LLM_REASONING_REPLAY=0 and confirm flat-text path still works

Worth flagging

  1. Token cost grows ~200-300 per accumulated item with replay. Net benefit is reasoning continuity, not cost. Flip to ZDR for cost win.
  2. DanaCodingAgent doesn't auto-resume from disk. Pre-existing behavior, not introduced here, but consumers of replay will hit it. Future ergonomics PR could add resume_from_session_id.
  3. Long sessions grow timeline.json since reasoning items persist alongside summary text. Compression drops them from the summary boundary, but steady-state size still increases. Monitor .dana/**/timeline.json size in prod.

…es 1-4)

Persist raw reasoning items from the OpenAI Responses API in
TimelineEntry.metadata; on subsequent calls splice them back into
input[] when provider:model:endpoint matches, so gpt-5+ retains
structured reasoning state across turns instead of re-deriving it
from flattened assistant text. Reduces token churn and improves
reasoning continuity. Native-tool codec only.

Provider capture (Phase 1):
  - LLMResponse.reasoning_items + response_id (raw output[] items)
  - _chat_via_responses opts in to include=reasoning.encrypted_content
    with sticky one-shot 400 fallback for accounts/api-versions that
    reject the flag
  - endpoint_hash + fingerprint properties on Azure / OpenAI providers
    (sha256 of base_url, first 8 chars)

Agent persistence (Phase 2):
  - ParsedResponse carries reasoning_items + response_id; both codecs
    populate them
  - star_agent _record_think_results writes
    {reasoning_items, fingerprint, response_id} into AGENT_THOUGHTS
    metadata on the reasoning entry only (not the response/tool_call
    entries)

Codec replay (Phase 3):
  - LLMMessage gains reasoning_items / reasoning_fingerprint /
    response_id carriers
  - Timeline.to_llm_messages propagates entry metadata onto assistant
    messages; merge step skips when items present so 1:1 fingerprint
    provenance is preserved
  - prepare_messages threads carriers via underscore-prefixed keys
  - _convert_to_responses_input splices raw items before assistant
    message when fingerprint matches the active provider; carriers
    always stripped before API call
  - LLM_REASONING_REPLAY=0 kill switch (default ON, env-resolved
    per call)

Compression policy (Phase 4):
  - NativeMessage.to_llm_message propagates reasoning fields - closes
    a Phase 3 gap where CompressedTimeline-routed traffic silently
    dropped them
  - compression_engine.compress composes summary metadata from a single
    explicit dict so reasoning items never survive the boundary

Tests: 49 new unit tests across capture, agent persistence, codec
replay (match/mismatch/kill switch/tool-call ordering/multi-turn),
and compression policy. 422/422 across LLM + core + compression
suites green.
…ale-id fallback

Surfaced by Phase 5 live verify (Azure gpt-5.2): replaying reasoning
items captured via model_dump() caused HTTP 400 because output-side
fields like 'status' aren't valid on input. Without a fallback, the
rejection silently killed every subsequent turn (user_message saved,
no agent_response).

Two fixes:

1. _serialize_output_item now whitelists input-side fields only:
   type, id, summary, encrypted_content. No more model_dump() leak of
   output-only metadata into replayed input[].

2. _chat_via_responses gains a one-shot stale-id / item-shape rejection
   fallback. On BadRequestError matching the rejection heuristic, strip
   reasoning items from input[] and retry once. Degrades to flat-text
   replay rather than failing the user's turn entirely.

Verify script (scripts/verify-reasoning-replay.py) confirms:
  - replay fires when ON (3 fires across turns 2-4)
  - kill switch (LLM_REASONING_REPLAY=0) cleanly disables replay
  - reasoning_items accumulate cumulatively turn-over-turn
  - all 4 turns complete cleanly with replay enabled
  - 329/329 unit tests still green

Findings recorded in plans/.../verify-260507-1905-reasoning-replay.md:
replay does NOT save tokens in steady-state - model reasons more
when prior state is present (~200-300 tokens per accumulated item).
Real benefit is reasoning continuity, not cost. Cost benefit
materializes when paired with ZDR/encrypted_content (smaller payload).
User flagged the gap: Phase 5 verified in-process multi-turn replay,
but did not exercise the fresh-process disk-load → resume → replay
path. This commit closes it with both unit tests and a live verify.

Findings:

1. The replay machinery works end-to-end across a process boundary.
   Persisted timeline.json round-trips through the legacy + native
   load formats, metadata survives, items splice into Responses API
   input[] when fingerprint matches.

2. DanaCodingAgent does NOT auto-resume from disk - each new process
   gets a fresh empty session. Resume requires explicit re-injection
   of persisted entries via agent._timeline.load_from_entries(...).
   Documented in the verify report.

Tests (4):
  - Legacy-format disk resume → splice fires with persisted items
  - Native-format disk resume → NativeMessage.from_dict preserves
    metadata.reasoning_items end-to-end
  - Cross-provider fingerprint mismatch → no replay (carriers stripped)
  - Kill switch overrides matching fingerprint after resume

Live verify (scripts/verify-disk-resume-replay.py): fresh process
loaded a 3-item timeline, replayed all 3 into input[] on the next
turn (1 splice fire), LLM call completed.

333/333 unit tests pass.
@ngoclam9415 ngoclam9415 force-pushed the feat/reasoning-state-replay branch from 2f11174 to b9a9a79 Compare May 10, 2026 14:24
@ngoclam9415 ngoclam9415 merged commit 799959f into develop May 10, 2026
1 check failed
@TheVinhLuong102 TheVinhLuong102 deleted the feat/reasoning-state-replay branch May 21, 2026 22:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant