feat(llm,timeline): cross-turn reasoning state replay for gpt-5/o3/o4#10
Merged
Conversation
…es 1-4)
Persist raw reasoning items from the OpenAI Responses API in
TimelineEntry.metadata; on subsequent calls splice them back into
input[] when provider:model:endpoint matches, so gpt-5+ retains
structured reasoning state across turns instead of re-deriving it
from flattened assistant text. Reduces token churn and improves
reasoning continuity. Native-tool codec only.
Provider capture (Phase 1):
- LLMResponse.reasoning_items + response_id (raw output[] items)
- _chat_via_responses opts in to include=reasoning.encrypted_content
with sticky one-shot 400 fallback for accounts/api-versions that
reject the flag
- endpoint_hash + fingerprint properties on Azure / OpenAI providers
(sha256 of base_url, first 8 chars)
Agent persistence (Phase 2):
- ParsedResponse carries reasoning_items + response_id; both codecs
populate them
- star_agent _record_think_results writes
{reasoning_items, fingerprint, response_id} into AGENT_THOUGHTS
metadata on the reasoning entry only (not the response/tool_call
entries)
Codec replay (Phase 3):
- LLMMessage gains reasoning_items / reasoning_fingerprint /
response_id carriers
- Timeline.to_llm_messages propagates entry metadata onto assistant
messages; merge step skips when items present so 1:1 fingerprint
provenance is preserved
- prepare_messages threads carriers via underscore-prefixed keys
- _convert_to_responses_input splices raw items before assistant
message when fingerprint matches the active provider; carriers
always stripped before API call
- LLM_REASONING_REPLAY=0 kill switch (default ON, env-resolved
per call)
Compression policy (Phase 4):
- NativeMessage.to_llm_message propagates reasoning fields - closes
a Phase 3 gap where CompressedTimeline-routed traffic silently
dropped them
- compression_engine.compress composes summary metadata from a single
explicit dict so reasoning items never survive the boundary
Tests: 49 new unit tests across capture, agent persistence, codec
replay (match/mismatch/kill switch/tool-call ordering/multi-turn),
and compression policy. 422/422 across LLM + core + compression
suites green.
…ale-id fallback Surfaced by Phase 5 live verify (Azure gpt-5.2): replaying reasoning items captured via model_dump() caused HTTP 400 because output-side fields like 'status' aren't valid on input. Without a fallback, the rejection silently killed every subsequent turn (user_message saved, no agent_response). Two fixes: 1. _serialize_output_item now whitelists input-side fields only: type, id, summary, encrypted_content. No more model_dump() leak of output-only metadata into replayed input[]. 2. _chat_via_responses gains a one-shot stale-id / item-shape rejection fallback. On BadRequestError matching the rejection heuristic, strip reasoning items from input[] and retry once. Degrades to flat-text replay rather than failing the user's turn entirely. Verify script (scripts/verify-reasoning-replay.py) confirms: - replay fires when ON (3 fires across turns 2-4) - kill switch (LLM_REASONING_REPLAY=0) cleanly disables replay - reasoning_items accumulate cumulatively turn-over-turn - all 4 turns complete cleanly with replay enabled - 329/329 unit tests still green Findings recorded in plans/.../verify-260507-1905-reasoning-replay.md: replay does NOT save tokens in steady-state - model reasons more when prior state is present (~200-300 tokens per accumulated item). Real benefit is reasoning continuity, not cost. Cost benefit materializes when paired with ZDR/encrypted_content (smaller payload).
User flagged the gap: Phase 5 verified in-process multi-turn replay,
but did not exercise the fresh-process disk-load → resume → replay
path. This commit closes it with both unit tests and a live verify.
Findings:
1. The replay machinery works end-to-end across a process boundary.
Persisted timeline.json round-trips through the legacy + native
load formats, metadata survives, items splice into Responses API
input[] when fingerprint matches.
2. DanaCodingAgent does NOT auto-resume from disk - each new process
gets a fresh empty session. Resume requires explicit re-injection
of persisted entries via agent._timeline.load_from_entries(...).
Documented in the verify report.
Tests (4):
- Legacy-format disk resume → splice fires with persisted items
- Native-format disk resume → NativeMessage.from_dict preserves
metadata.reasoning_items end-to-end
- Cross-provider fingerprint mismatch → no replay (carriers stripped)
- Kill switch overrides matching fingerprint after resume
Live verify (scripts/verify-disk-resume-replay.py): fresh process
loaded a 3-item timeline, replayed all 3 into input[] on the next
turn (1 splice fire), LLM call completed.
333/333 unit tests pass.
2f11174 to
b9a9a79
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Persist raw reasoning items from the OpenAI Responses API in
TimelineEntry.metadata. On the next turn, splice them back intoinput[]whenprovider:model:endpointmatches, so gpt-5+ retains structured reasoning state across turns instead of re-deriving it from flattened assistant text.Stacks on #9 — diff includes its 5 commits (Responses API plumbing, env-driven effort, agent reasoning persistence). When #9 merges, rebase this PR.
What changes
LLMResponsereasoning_items: list[dict] | Noneandresponse_idLLMMessagereasoning_items/reasoning_fingerprint/response_idOpenAICompatibleProviderinclude=["reasoning.encrypted_content"]with sticky 400 fallback;endpoint_hash+fingerprintproperties_chat_via_responses_convert_to_responses_inputLLM_REASONING_REPLAY=0kill switch (default ON)Timeline.to_llm_messagesSTARAgent._record_think_results{reasoning_items, fingerprint, response_id}intoAGENT_THOUGHTS.metadataNativeMessage.to_llm_messagecompression_engineArchitecture
Counter-intuitive finding (live verify)
Replay does NOT save tokens in steady-state. When prior reasoning state is present, gpt-5 reasons every turn (~200-300 tokens per accumulated item). Without replay, the model nondeterministically skips reasoning, sometimes saving tokens but losing continuity.
Real benefit:
Records:
plans/260507-1829-reasoning-state-replay/reports/Critical bug found and fixed during verify
model_dump()was leaking output-only fields (status,content) into replayedinput[]. Azure rejected withunknown_parameter: input[N].status, silently killing all subsequent turns. Now whitelisted to{type, id, summary, encrypted_content}+ stale-id rejection fallback for graceful degradation.Operator runbook
Resume from disk (one ergonomics gotcha):
DanaCodingAgentdoesn't auto-resume sessions. To continue a persisted conversation in a fresh process:Tests
333/333 unit tests green:
tests/unit/llm/test_reasoning_items_capture.py(14) — capture path, encrypted_content presence/absence, include-flag fallback, endpoint hash, fingerprint formattests/unit/core/agent/test_thinking_metadata_persistence.py(11) — _build_thinking_metadata, _provider_fingerprint defensive lookup, both _record_think_results branches, JSON round-trip, isolationtests/unit/llm/test_reasoning_replay.py(17) — splice match/mismatch, kill switch (incl. falsy values), tool-call ordering, multi-turn, carrier stripping, Timeline.to_llm_messages propagation, merge-skiptests/unit/test_reasoning_compression_policy.py(7) — NativeMessage propagation, CompressedTimeline end-to-end, compressed-away entries vanish, kept entries retain items, no-keep edge casetests/unit/test_reasoning_replay_disk_resume.py(4) — legacy + native format disk-resume, cross-provider mismatch, kill switch on resumed timelineLive verify (Azure gpt-5.2):
scripts/verify-reasoning-replay.py— 4-turn × 2-run (ON/OFF), splice fires confirmed, kill switch confirmedscripts/verify-disk-resume-replay.py— fresh process loads 3-item timeline, replays all 3 to input[]Test plan
uv run pytest tests/unit -q→ all greenuv run python scripts/verify-reasoning-replay.py→ 4 hard PASSuv run python scripts/verify-disk-resume-replay.py→ 3 hard PASSLLM_REASONING_REPLAY=0and confirm flat-text path still worksWorth flagging
resume_from_session_id..dana/**/timeline.jsonsize in prod.