Skip to content

Cold-start MCP-init timeout aborts agent run on fresh install #335

@harshalmore31

Description

@harshalmore31

Failure

Right after a fresh uv sync, the very first openai-agent run aborted with:

mcp.shared.exceptions.McpError: Timed out while waiting for response to ClientRequest. Waited 5.0 seconds.

The traceback ends in agents/mcp/server.py:726 → connect → session.initialize. One of the six MCP servers spawned via uv run <name>-mcp-server exceeded the 5s handshake window during cold start, and the whole run died.

Only reproducible on the genuinely first invocation on a cold machine — warm runs are fine. After uv sync, the OS page cache is empty and the first uv run <server> pays both uv run setup cost and a full import cost for that server's heavy modules. I measured a one-time cold start of ~12s for the vibration server in isolation (scipy + numpy + DSP submodules), down to 0.89s once warm. Warm handshake latencies across all six servers were 0.26–0.89s.

Root cause

The runners never set the MCP client-session timeout, so the SDK default of 5s bites on cold start:

  • src/agent/openai_agent/runner.py:91MCPServerStdio(...) constructed without client_session_timeout_seconds. SDK default is 5 (openai-agents agents/mcp/server.py:1055, 1176, 1311).
  • src/agent/deep_agent/runner.py:199MultiServerMCPClient(connections) constructed without session_kwargs, so ClientSession(read_timeout_seconds=...) falls back to the library default.
  • src/agent/claude_agent/runner.py:65-72McpStdioServerConfig dicts have no timeout knob in the TypedDict surface; behaviour is internal to claude-agent-sdk and didn't crash in my testing, but the same pattern of leaving SDK defaults unset applies.

This is the worst possible moment for a failure — the user's first run on a fresh clone. Cold-cache flakiness on a one-line knob isn't a real bug story.

Proposed solution

Plumb a generous client-session timeout through all three SDK runners. ~30s is fine for cold-start; warm cases are well under 1s so there's no operational downside.

Concretely:

  • openai-agent: pass client_session_timeout_seconds=30 to MCPServerStdio(...).
  • deep-agent: pass session_kwargs={"read_timeout_seconds": timedelta(seconds=30)} per connection in _build_mcp_connections.
  • claude-agent: best handled via whatever McpStdioServerConfig extension or session-level config the SDK exposes — needs a quick look at the SDK surface.

Optionally expose the value as an env var (MCP_INIT_TIMEOUT_SECONDS) or a --mcp-timeout CLI flag for users on slow machines.

Happy to send a PR.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions