Skip to content

Comments

fix: strip reasoning items from Responses API input when store=False#2178

Open
mr-karan wants to merge 4 commits intoOpenHands:mainfrom
mr-karan:fix/strip-reasoning-items-store-false
Open

fix: strip reasoning items from Responses API input when store=False#2178
mr-karan wants to merge 4 commits intoOpenHands:mainfrom
mr-karan:fix/strip-reasoning-items-store-false

Conversation

@mr-karan
Copy link
Contributor

Summary

When store=False (the default in agent/utils.py:221), OpenAI doesn't persist conversation items. The SDK serializes reasoning item IDs from previous turns back into the input via Message.to_responses_dict(), but OpenAI can't resolve them:

"Item with id 'rs_...' not found. Items are not persisted when `store` is set to false.
Try again with `store` set to true, or remove this item from your input."

This is reproducible with any multi-turn conversation using reasoning models (gpt-5.x, codex, o3, o4) via the Responses API.

Fix: Filter out type: "reasoning" items from the input list in LLM.responses() when the effective store value is False. The model reasons fresh each turn from the full conversation context (user messages, tool calls, tool results), so omitting echoed reasoning items from prior turns doesn't affect output quality.

Reproduction

  1. Use any reasoning model via the Responses API (e.g. gpt-5.1-codex-mini)
  2. Start a multi-turn conversation (agent loop with tool calls)
  3. First turn succeeds; second turn fails with the "Item not found" error

Related

Test plan

  • Added unit test: test_reasoning_items_stripped_from_input_when_store_false
  • All 627 existing LLM tests pass
  • Manually verified with gpt-5.1-codex-mini and gpt-5.2 (multi-turn agent conversations complete successfully)

When store=False (the default for agent calls), OpenAI doesn't persist
conversation items. The SDK serializes reasoning item IDs from previous
turns back into the input, but OpenAI can't resolve them, causing:

  "Item with id 'rs_...' not found. Items are not persisted when
   `store` is set to false."

This is reproducible with any multi-turn conversation using reasoning
models (gpt-5.x, codex, o3, o4) via the Responses API.

Fix: filter out `type: "reasoning"` items from the input list in
LLM.responses() when the effective store value is False. The model
reasons fresh each turn from the full conversation context (user
messages, tool calls, tool results), so omitting echoed reasoning
items from prior turns doesn't affect output quality.
mr-karan added a commit to mr-karan/hodor that referenced this pull request Feb 23, 2026
The OpenHands SDK replays reasoning item IDs on subsequent turns, but
with store=False (the default) OpenAI can't resolve them, breaking
all multi-turn conversations with gpt-5.x/codex/o3/o4 models.

Fix: strip reasoning items at parse time (from_llm_responses_output)
so they're never stored on the Message and never replayed. The model
reasons fresh each turn from the full conversation context.

- Upgrade openhands-sdk/tools from v1.8.1 to v1.11.5
- Patch Message.from_llm_responses_output to discard reasoning items
- Remove fragile to_responses_dict and dual-import patches
- Add 3 regression tests for the reasoning strip behaviour
- Fix conftest.py stubs shadowing real SDK in test environment
- Upstream fix: OpenHands/software-agent-sdk#2178

Bump version to 0.2.12.
mr-karan added a commit to mr-karan/hodor that referenced this pull request Feb 23, 2026
The OpenHands SDK replays reasoning item IDs on subsequent turns, but
with store=False (the default) OpenAI can't resolve them, breaking
all multi-turn conversations with gpt-5.x/codex/o3/o4 models.

Fix: strip reasoning items at parse time (from_llm_responses_output)
so they're never stored on the Message and never replayed. The model
reasons fresh each turn from the full conversation context.

- Upgrade openhands-sdk/tools from v1.8.1 to v1.11.5
- Patch Message.from_llm_responses_output to discard reasoning items
- Remove fragile to_responses_dict and dual-import patches
- Add 3 regression tests for the reasoning strip behaviour
- Fix conftest.py stubs shadowing real SDK in test environment
- Upstream fix: OpenHands/software-agent-sdk#2178

Bump version to 0.2.12.
Copy link
Collaborator

@enyst enyst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the contribution! According to OpenAI docs, I think the behavior described in the PR description should not accurate: the reasoning items are available and can be part of a conversation with store=False

But I've seen this before 🤔

Could you tell how you ran the sdk?

@mr-karan
Copy link
Contributor Author

Hi @enyst, thanks for the review!

I hit this while running hodor (a PR review agent built on top of the OpenHands SDK). The agent loop naturally produces multi-turn conversations — prompt → bash tool calls → observations → next turn.

With any reasoning model (o4-mini, gpt-5.1-codex-mini, gpt-5.2) via the Responses API, the first turn succeeds fine. On the second turn, Message.to_responses_dict() serializes reasoning item IDs (rs_...) from the previous turn back into the input. Since the SDK defaults to store=False (agent/utils.py:221), OpenAI can't resolve those IDs:

"Item with id 'rs_...' not found. Items are not persisted when store is set to false."

I think the docs may be describing the behavior when store=True — in that case reasoning items are persisted server-side and referencing them by ID works. With store=False, the items aren't persisted between API calls, so echoing the IDs back causes the error.

Here's a minimal repro script (no SDK dependency, just the raw OpenAI client) that demonstrates the issue:

# /// script
# requires-python = ">=3.11"
# dependencies = ["openai>=1.11.0"]
# ///
"""
Usage:
    export OPENAI_API_KEY=sk-...
    uv run repro_reasoning_store_false.py
"""
from openai import OpenAI

client = OpenAI()
model = "o4-mini"  # any reasoning model

# Turn 1: succeeds
resp = client.responses.create(
    model=model,
    input="What is 2+2? Be brief.",
    store=False,
)
prior_items = [item.to_dict() for item in resp.output]

# Turn 2: fails — reasoning item IDs can't be resolved
resp2 = client.responses.create(
    model=model,
    input=prior_items + [{"role": "user", "content": "Now what is 3+3?"}],
    store=False,
)

Output:

[turn 1] OK — got 2 output items
[turn 1] Reasoning item IDs being sent back: ['rs_0c43e2006a44...']

[turn 2] FAILED as expected:
  Error code: 404 - Item with id 'rs_0c43e200...' not found.
  Items are not persisted when `store` is set to false.

[turn 2 fixed] Stripping reasoning items and retrying ...
[turn 2 fixed] OK

Turn 1 succeeds, turn 2 fails with 404, stripping type: "reasoning" items from input fixes it.

@enyst
Copy link
Collaborator

enyst commented Feb 24, 2026

@OpenHands Read all comments in this PR. Test the script, using the openai key, and try multiple models on Responses API: o4-mini (also check if this is on Responses), gpt-5-nano, gpt-5.2, gpt-5.2-codex. or so.
Record in a .pr/ directory in a .md file each result and what you did.

Read https://developers.openai.com/api/reference/resources/responses/methods/create and investigate the docs. Note that we only care about store=False. Then, if you learn something about how you could fix the script so it works with reasoning items and with store=False, try it, preferably on gpt-5-nano (or some other version number for mini or nano, if 5 is too old)

Commit the files / test results in the .pr directory, it's a directory just for testing. Do not modify the PR otherwise. Note that it's from a fork so make sure you set the remote correctly.

@openhands-ai
Copy link

openhands-ai bot commented Feb 24, 2026

I'm on it! enyst can track my progress at all-hands.dev

Co-authored-by: openhands <openhands@all-hands.dev>
@openhands-ai
Copy link

openhands-ai bot commented Feb 24, 2026

Final summary

New changes made

  • Added and committed only PR-validation artifacts under .pr/:
    • .pr/repro_responses_reasoning_store_false.py (standalone repro/exploration script)
    • .pr/repro_responses_reasoning_store_false.output.txt (captured run output)
    • .pr/responses_api_store_false_reasoning_results.md (written summary + findings)
  • No SDK/production code changes were made beyond the existing PR contents.

Verification / instructions checklist

  • Read all PR comments/review context.
  • Tested with OPENAI_API_KEY against multiple Responses API models (o4-mini, gpt-5-nano, gpt-5.2, gpt-5.2-codex) using store=False.
  • Recorded what I did + results in a .pr/*.md file.
  • Committed and pushed only .pr/ files (no other repo changes included).
  • Accounted for fork PR: pushed to the PR head repo/branch (mr-karan/software-agent-sdk:fix/strip-reasoning-items-store-false).

Conciseness check

  • The commit contains only the requested .pr/ artifacts; nothing else was added/modified.
  • Any tool-generated scratch (.agent_tmp/) remained uncommitted.

Pushed update

  • Pushed commit b0c8c107 to the PR branch fix/strip-reasoning-items-store-false (fork remote).

Co-authored-by: openhands <openhands@all-hands.dev>
@enyst
Copy link
Collaborator

enyst commented Feb 24, 2026

(OpenHands-GPT-5.2 speaking)

I tested include=["reasoning.encrypted_content"] (what the docs call out for stateless / store=False) and it does change the behavior materially.

1) What worked with encrypted reasoning included

I updated the .pr/repro_responses_reasoning_store_false.py script to support:

uv run python .pr/repro_responses_reasoning_store_false.py --include-encrypted

That sets include=["reasoning.encrypted_content"] on both turns.

Results (Responses API, store=False):

  • For o4-mini, gpt-5-nano, gpt-5.2-codex:

    • Turn 1 returns a type:"reasoning" output item with an rs_... id and a non-empty encrypted_content (I recorded only lengths, e.g. 1124/1144/868 chars).
    • Turn 2 succeeds even when echoing response.output verbatim (i.e., including the rs_... reasoning item + its id). No 404.
  • For gpt-5.2, turn 1 didn’t include a reasoning output item in my run; turn 2 worked either way.

The corresponding output artifact is:

  • .pr/repro_responses_reasoning_store_false.include_encrypted.output.txt

Baseline (no include) still reproduces the 404 as before:

  • .pr/repro_responses_reasoning_store_false.output.txt

I also updated the report:

  • .pr/responses_api_store_false_reasoning_results.md

Pushed in commit: 12cd3b2d (only .pr/ artifacts).

2) Does the SDK set this already?

Yes — in the current repo state, the SDK already has support for this and it’s enabled by default:

  • LLM.enable_encrypted_reasoning defaults to True

    • openhands-sdk/openhands/sdk/llm/llm.py (around line ~291)
  • For Responses API calls, when store=False and enable_encrypted_reasoning=True, we append "reasoning.encrypted_content" to the include list:

    • openhands-sdk/openhands/sdk/llm/options/responses_options.py

So the SDK should already be requesting encrypted reasoning on stateless Responses calls, unless someone explicitly disables enable_encrypted_reasoning.

If you want, I can also add one more targeted experiment in .pr/ that mimics the SDK’s exact serialization path (i.e., taking a Message with a responses_reasoning_item and round-tripping) to confirm the encrypted_content is preserved across turns in the SDK objects.

@enyst
Copy link
Collaborator

enyst commented Feb 24, 2026

HUMAN: @mr-karan I think we see what's happening: you are entirely correct, however the SDK should have asked for encrypted_reasoning, not the other id... oof why does OpenAI even send something we can't use. 😢

I think we have two alternatives:

  • ask for encrypted_reasoning, and send it back - seems better IMHO, the model will have its past reasoning in context and can continue it (OpenAI decrypts server-side, and it's the recommended way for stateless requests)
  • don't ask for encrypted, but then we also need to strip this as you suggested.

Co-authored-by: openhands <openhands@all-hands.dev>
@enyst
Copy link
Collaborator

enyst commented Feb 24, 2026

(OpenHands-GPT-5.2 speaking)

1) “Real use” in the SDK (Responses API via LiteLLM)

I added a SDK-level repro that exercises the actual OpenHands call path:

  • Turn 1 via LLM.responses() (SDK parsing)
  • Turn 2 input via LLM.format_messages_for_responses() (so the prior reasoning item is replayed exactly like the SDK would)
  • Turn 2 API call via litellm.responses.main.responses(...)

Artifacts:

  • .pr/sdk_responses_stateless_replay_test.py
  • Outputs:
    • .pr/sdk_responses_stateless_replay_test.o4-mini.output.txt
    • .pr/sdk_responses_stateless_replay_test.openai_gpt-5-nano.output.txt
    • .pr/sdk_responses_stateless_replay_test.openai_gpt-5.2-codex.output.txt
    • .pr/sdk_responses_stateless_replay_test.openai_gpt-5.2.output.txt

What I observed

For reasoning models (openai/o4-mini, openai/gpt-5-nano, openai/gpt-5.2-codex):

  • With enable_encrypted_reasoning=True (SDK default):

    • Turn 1 reasoning item had non-empty encrypted_content
    • Turn 2 succeeds even when replaying the reasoning item verbatim with store=False
  • With enable_encrypted_reasoning=False:

    • Turn 1 reasoning item has encrypted_content=None
    • Turn 2 fails with:
      Item with id 'rs_...' not found. Items are not persisted when store is set to false...
  • With the current PR behavior (strip reasoning items from input when store=False):

    • Turn 2 succeeds even when encrypted reasoning is disabled

For openai/gpt-5.2, I didn’t get a type:"reasoning" output item in these runs, so the replay problem doesn’t show up.

Does the SDK set it?

Yes:

  • LLM.enable_encrypted_reasoning defaults to True
  • select_responses_options() appends "reasoning.encrypted_content" to include when store=False and that flag is enabled.

(Details + file pointers are also written up in .pr/sdk_and_hodor_investigation.md.)

2) hodor: why did he see the error even though it’s built on the SDK?

I cloned https://github.com/mr-karan/hodor and inspected its OpenHands integration.

Key finding: hodor intentionally disables encrypted reasoning:

  • hodor/hodor/llm/openhands_client.py sets:
    • llm_config["enable_encrypted_reasoning"] = False
    • and even monkeypatches the SDK options to ensure "reasoning.encrypted_content" gets removed from include.

It also contains a monkeypatch to drop responses_reasoning_item at parse-time (so it never gets replayed), with comments explicitly referencing PR #2178 as the upstream fix. So the 404 error is exactly what you get when:

  • encrypted reasoning is disabled and
  • reasoning items are still being replayed (e.g., before those patches existed / if they aren’t active in a particular execution path).

Where I recorded everything / pushed

I committed and pushed only .pr/ artifacts to the PR branch in commit:

  • 938b16ed

Main writeup:

  • .pr/sdk_and_hodor_investigation.md

If you want to adjust the PR’s fix based on this: one nuance is that unconditionally stripping reasoning items when store=False also strips the valid stateless path (when encrypted_content is present). A more targeted strategy could be: only strip reasoning items when they lack encrypted_content (or when encrypted reasoning is disabled). I didn’t change the PR code since you asked to keep changes in .pr/ only, but I can implement that if you decide that’s the desired behavior.

@enyst
Copy link
Collaborator

enyst commented Feb 24, 2026

HUMAN: (we kinda have to say it around here 😅 )

hodor/hodor/llm/openhands_client.py sets:

  • llm_config["enable_encrypted_reasoning"] = False
  • and even monkeypatches the SDK options to ensure "reasoning.encrypted_content" gets removed from include.

I agree it's a bug that the SDK fails if you disable encrypted_reasoning, but I'm curious why you do? It's literally the model reasoning... 🤔 This is what OpenAI doc says:

reasoning.encrypted_content: Includes an encrypted version of reasoning tokens in reasoning item outputs. This enables reasoning items to be used in multi-turn conversations when using the Responses API statelessly (like when the store parameter is set to false, or when an organization is enrolled in the zero data retention program).

Or here:

To disable statefulness, but still take advantage of reasoning:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants