fix(timeline,tools): GPT-5 timeline robustness — reasoning-item drop + tool-batch isolation#11
Merged
Conversation
GPT-5/o3/o4 turns return a reasoning item (rs_… + encrypted_content) with an empty summary on low-summary turns — typically single-call tool continuations. The AGENT_THOUGHTS gate in _record_think_results keyed on summary text, so these turns produced no timeline entry and the encrypted reasoning item was silently dropped. On resume the affected turns replay with no reasoning state, breaking cross-turn reasoning continuity for GPT-5/o3/o4. Gate now keys on reasoning_items presence, not summary text. Entry is emitted with empty content when only the item is present; metadata still carries the item for replay. Add TestEmptySummaryReasoningPersistence covering tool-call and direct-answer branches plus the no-items negative case.
A dispatch-phase exception (registry getattr, name parsing, object lookup) escaped before the inner try block in _execute_single_call and _execute_single_call_async. In the async path it propagated out of asyncio.gather, discarding the entire batch's results — including calls that succeeded. The TOOL_CALL entry still recorded N tool_call_ids, so the next OpenAI turn 400s on the unanswered tool calls. - Wrap each single-call dispatcher in one outer guard covering dispatch and execution; both are now non-raising. Removes the redundant inner try blocks. - execute_tools_async: asyncio.gather(return_exceptions=True) and convert any escaped exception to an error result — defense-in-depth. Every tool_call_id in a batch now always gets a result. Add isolation tests covering async and sync paths: a failing call yields an isolated error result, siblings succeed, tool_call_ids preserved.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Two independent robustness fixes for GPT-5/o3/o4 timelines. Both surfaced while verifying a real session timeline (
atlas-q1).Commit 1 — persist reasoning items on empty-summary turns
Problem
GPT-5/o3/o4 turns return a reasoning item (
rs_…+encrypted_content) with an emptysummaryon low-summary turns — typically single-call tool continuations._record_think_resultsgatedAGENT_THOUGHTScreation on summary text:When the summary was empty, no entry was created — and
reasoning_items(which rides in that entry'smetadata) was silently dropped along with it. Observed: 7 of ~14 model turns inatlas-q1were baretool_callentries with no precedingagent_thoughts.Impact: on resume, those turns replay with no reasoning state. Cross-turn reasoning replay (PR #10) silently does not apply to low-summary turns.
Fix
Gate now keys on
reasoning_itemspresence, not summary text. Entry emitted with emptycontentwhen only the item is present;metadatastill carries the item for replay. Applied to both branches.Commit 2 — isolate tool-call failures within a batch
Problem
A dispatch-phase exception (registry
getattr, name parsing, object lookup) escaped before the innertryin_execute_single_call/_execute_single_call_async. In the async path it propagated out ofasyncio.gather, discarding the entire batch's results — including calls that succeeded.The
TOOL_CALLentry already recorded Ntool_call_ids, so the next OpenAI turn 400s on the unanswered tool calls.Fix
tryblocks removed.execute_tools_async:asyncio.gather(return_exceptions=True)+ escaped-exception → error result (defense-in-depth).Every
tool_call_idin a batch now always gets a result.Tests
TestEmptySummaryReasoningPersistence— tool-call / direct-answer branches + no-items negative case.tool_call_ids preserved.test_thinking_metadata_persistence.py14/14,test_tool_executor_parallel.py9/9,tests/unit/core/+tests/regression/202 passed / 18 skipped.TestEndpointHashAndFingerprint(test pollution, fail on clean tree too) — out of scope.Notes / not in scope
atlas-q1) stay unrecoverable; commit 1 fixes capture going forward only.