fix: strip reasoning items from Responses API input when store=False#2178
fix: strip reasoning items from Responses API input when store=False#2178mr-karan wants to merge 4 commits intoOpenHands:mainfrom
Conversation
When store=False (the default for agent calls), OpenAI doesn't persist conversation items. The SDK serializes reasoning item IDs from previous turns back into the input, but OpenAI can't resolve them, causing: "Item with id 'rs_...' not found. Items are not persisted when `store` is set to false." This is reproducible with any multi-turn conversation using reasoning models (gpt-5.x, codex, o3, o4) via the Responses API. Fix: filter out `type: "reasoning"` items from the input list in LLM.responses() when the effective store value is False. The model reasons fresh each turn from the full conversation context (user messages, tool calls, tool results), so omitting echoed reasoning items from prior turns doesn't affect output quality.
The OpenHands SDK replays reasoning item IDs on subsequent turns, but with store=False (the default) OpenAI can't resolve them, breaking all multi-turn conversations with gpt-5.x/codex/o3/o4 models. Fix: strip reasoning items at parse time (from_llm_responses_output) so they're never stored on the Message and never replayed. The model reasons fresh each turn from the full conversation context. - Upgrade openhands-sdk/tools from v1.8.1 to v1.11.5 - Patch Message.from_llm_responses_output to discard reasoning items - Remove fragile to_responses_dict and dual-import patches - Add 3 regression tests for the reasoning strip behaviour - Fix conftest.py stubs shadowing real SDK in test environment - Upstream fix: OpenHands/software-agent-sdk#2178 Bump version to 0.2.12.
The OpenHands SDK replays reasoning item IDs on subsequent turns, but with store=False (the default) OpenAI can't resolve them, breaking all multi-turn conversations with gpt-5.x/codex/o3/o4 models. Fix: strip reasoning items at parse time (from_llm_responses_output) so they're never stored on the Message and never replayed. The model reasons fresh each turn from the full conversation context. - Upgrade openhands-sdk/tools from v1.8.1 to v1.11.5 - Patch Message.from_llm_responses_output to discard reasoning items - Remove fragile to_responses_dict and dual-import patches - Add 3 regression tests for the reasoning strip behaviour - Fix conftest.py stubs shadowing real SDK in test environment - Upstream fix: OpenHands/software-agent-sdk#2178 Bump version to 0.2.12.
enyst
left a comment
There was a problem hiding this comment.
Thank you for the contribution! According to OpenAI docs, I think the behavior described in the PR description should not accurate: the reasoning items are available and can be part of a conversation with store=False
But I've seen this before 🤔
Could you tell how you ran the sdk?
|
Hi @enyst, thanks for the review! I hit this while running hodor (a PR review agent built on top of the OpenHands SDK). The agent loop naturally produces multi-turn conversations — prompt → bash tool calls → observations → next turn. With any reasoning model ( I think the docs may be describing the behavior when Here's a minimal repro script (no SDK dependency, just the raw OpenAI client) that demonstrates the issue: # /// script
# requires-python = ">=3.11"
# dependencies = ["openai>=1.11.0"]
# ///
"""
Usage:
export OPENAI_API_KEY=sk-...
uv run repro_reasoning_store_false.py
"""
from openai import OpenAI
client = OpenAI()
model = "o4-mini" # any reasoning model
# Turn 1: succeeds
resp = client.responses.create(
model=model,
input="What is 2+2? Be brief.",
store=False,
)
prior_items = [item.to_dict() for item in resp.output]
# Turn 2: fails — reasoning item IDs can't be resolved
resp2 = client.responses.create(
model=model,
input=prior_items + [{"role": "user", "content": "Now what is 3+3?"}],
store=False,
)Output: Turn 1 succeeds, turn 2 fails with 404, stripping |
|
@OpenHands Read all comments in this PR. Test the script, using the openai key, and try multiple models on Read https://developers.openai.com/api/reference/resources/responses/methods/create and investigate the docs. Note that we only care about Commit the files / test results in the .pr directory, it's a directory just for testing. Do not modify the PR otherwise. Note that it's from a fork so make sure you set the remote correctly. |
|
I'm on it! enyst can track my progress at all-hands.dev |
Co-authored-by: openhands <openhands@all-hands.dev>
Final summaryNew changes made
Verification / instructions checklist
Conciseness check
Pushed update
|
Co-authored-by: openhands <openhands@all-hands.dev>
|
(OpenHands-GPT-5.2 speaking) I tested 1) What worked with encrypted reasoning includedI updated the uv run python .pr/repro_responses_reasoning_store_false.py --include-encryptedThat sets Results (Responses API,
The corresponding output artifact is:
Baseline (no include) still reproduces the 404 as before:
I also updated the report:
Pushed in commit: 2) Does the SDK set this already?Yes — in the current repo state, the SDK already has support for this and it’s enabled by default:
So the SDK should already be requesting encrypted reasoning on stateless Responses calls, unless someone explicitly disables If you want, I can also add one more targeted experiment in |
|
HUMAN: @mr-karan I think we see what's happening: you are entirely correct, however the SDK should have asked for I think we have two alternatives:
|
Co-authored-by: openhands <openhands@all-hands.dev>
|
(OpenHands-GPT-5.2 speaking) 1) “Real use” in the SDK (Responses API via LiteLLM)I added a SDK-level repro that exercises the actual OpenHands call path:
Artifacts:
What I observedFor reasoning models (
For Does the SDK set it?Yes:
(Details + file pointers are also written up in 2) hodor: why did he see the error even though it’s built on the SDK?I cloned Key finding: hodor intentionally disables encrypted reasoning:
It also contains a monkeypatch to drop
Where I recorded everything / pushedI committed and pushed only
Main writeup:
If you want to adjust the PR’s fix based on this: one nuance is that unconditionally stripping reasoning items when |
|
HUMAN: (we kinda have to say it around here 😅 )
I agree it's a bug that the SDK fails if you disable
Or here:
|
Summary
When
store=False(the default inagent/utils.py:221), OpenAI doesn't persist conversation items. The SDK serializes reasoning item IDs from previous turns back into the input viaMessage.to_responses_dict(), but OpenAI can't resolve them:This is reproducible with any multi-turn conversation using reasoning models (
gpt-5.x,codex,o3,o4) via the Responses API.Fix: Filter out
type: "reasoning"items from the input list inLLM.responses()when the effectivestorevalue isFalse. The model reasons fresh each turn from the full conversation context (user messages, tool calls, tool results), so omitting echoed reasoning items from prior turns doesn't affect output quality.Reproduction
gpt-5.1-codex-mini)Related
store=Truecase — "reasoning item without its required following item"). These are complementary fixes: this PR handlesstore=False, fix: keep message id/status for Responses API round-trip #1754 handlesstore=Trueserialization format.Test plan
test_reasoning_items_stripped_from_input_when_store_falsegpt-5.1-codex-miniandgpt-5.2(multi-turn agent conversations complete successfully)