Skip to content

fix(timeline,tools): GPT-5 timeline robustness — reasoning-item drop + tool-batch isolation#11

Merged
ngoclam9415 merged 2 commits into
developfrom
fix/reasoning-item-empty-summary-drop
May 19, 2026
Merged

fix(timeline,tools): GPT-5 timeline robustness — reasoning-item drop + tool-batch isolation#11
ngoclam9415 merged 2 commits into
developfrom
fix/reasoning-item-empty-summary-drop

Conversation

@ngoclam9415
Copy link
Copy Markdown
Contributor

@ngoclam9415 ngoclam9415 commented May 18, 2026

Two independent robustness fixes for GPT-5/o3/o4 timelines. Both surfaced while verifying a real session timeline (atlas-q1).


Commit 1 — persist reasoning items on empty-summary turns

Problem

GPT-5/o3/o4 turns return a reasoning item (rs_… + encrypted_content) with an empty summary on low-summary turns — typically single-call tool continuations. _record_think_results gated AGENT_THOUGHTS creation on summary text:

if reasoning and len(reasoning) > 0:   # keys on summary text

When the summary was empty, no entry was created — and reasoning_items (which rides in that entry's metadata) was silently dropped along with it. Observed: 7 of ~14 model turns in atlas-q1 were bare tool_call entries with no preceding agent_thoughts.

Impact: on resume, those turns replay with no reasoning state. Cross-turn reasoning replay (PR #10) silently does not apply to low-summary turns.

Fix

Gate now keys on reasoning_items presence, not summary text. Entry emitted with empty content when only the item is present; metadata still carries the item for replay. Applied to both branches.


Commit 2 — isolate tool-call failures within a batch

Problem

A dispatch-phase exception (registry getattr, name parsing, object lookup) escaped before the inner try in _execute_single_call / _execute_single_call_async. In the async path it propagated out of asyncio.gather, discarding the entire batch's results — including calls that succeeded.

The TOOL_CALL entry already recorded N tool_call_ids, so the next OpenAI turn 400s on the unanswered tool calls.

Fix

  • Each single-call dispatcher wrapped in one outer guard covering dispatch + execution; both now non-raising. Redundant inner try blocks removed.
  • execute_tools_async: asyncio.gather(return_exceptions=True) + escaped-exception → error result (defense-in-depth).

Every tool_call_id in a batch now always gets a result.


Tests

  • TestEmptySummaryReasoningPersistence — tool-call / direct-answer branches + no-items negative case.
  • Batch-isolation tests — async + sync paths: failing call isolated, siblings succeed, tool_call_ids preserved.
  • Followed RED→GREEN for both.
  • test_thinking_metadata_persistence.py 14/14, test_tool_executor_parallel.py 9/9, tests/unit/core/ + tests/regression/ 202 passed / 18 skipped.
  • 2 pre-existing failures in TestEndpointHashAndFingerprint (test pollution, fail on clean tree too) — out of scope.

Notes / not in scope

  • Mixed async/sync tool batches still degrade (sync tool blocks the event loop). That's a separate design item — needs tool concurrency classification — deliberately not bundled here.
  • Existing broken timelines (e.g. atlas-q1) stay unrecoverable; commit 1 fixes capture going forward only.

GPT-5/o3/o4 turns return a reasoning item (rs_… + encrypted_content)
with an empty summary on low-summary turns — typically single-call
tool continuations. The AGENT_THOUGHTS gate in _record_think_results
keyed on summary text, so these turns produced no timeline entry and
the encrypted reasoning item was silently dropped.

On resume the affected turns replay with no reasoning state, breaking
cross-turn reasoning continuity for GPT-5/o3/o4.

Gate now keys on reasoning_items presence, not summary text. Entry is
emitted with empty content when only the item is present; metadata
still carries the item for replay.

Add TestEmptySummaryReasoningPersistence covering tool-call and
direct-answer branches plus the no-items negative case.
A dispatch-phase exception (registry getattr, name parsing, object
lookup) escaped before the inner try block in _execute_single_call and
_execute_single_call_async. In the async path it propagated out of
asyncio.gather, discarding the entire batch's results — including calls
that succeeded.

The TOOL_CALL entry still recorded N tool_call_ids, so the next OpenAI
turn 400s on the unanswered tool calls.

- Wrap each single-call dispatcher in one outer guard covering dispatch
  and execution; both are now non-raising. Removes the redundant inner
  try blocks.
- execute_tools_async: asyncio.gather(return_exceptions=True) and
  convert any escaped exception to an error result — defense-in-depth.

Every tool_call_id in a batch now always gets a result.

Add isolation tests covering async and sync paths: a failing call
yields an isolated error result, siblings succeed, tool_call_ids
preserved.
@ngoclam9415 ngoclam9415 changed the title fix(timeline): persist reasoning items on empty-summary turns fix(timeline,tools): GPT-5 timeline robustness — reasoning-item drop + tool-batch isolation May 18, 2026
@ngoclam9415 ngoclam9415 merged commit 1e31c41 into develop May 19, 2026
1 check failed
@TheVinhLuong102 TheVinhLuong102 deleted the fix/reasoning-item-empty-summary-drop branch May 21, 2026 22:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant