Skip to content

fix(runtime): add wall-clock timeouts to JSON-RPC request and captureTurn#302

Open
haikaljeh wants to merge 1 commit intoopenai:mainfrom
haikaljeh:fix/runtime-timeouts
Open

fix(runtime): add wall-clock timeouts to JSON-RPC request and captureTurn#302
haikaljeh wants to merge 1 commit intoopenai:mainfrom
haikaljeh:fix/runtime-timeouts

Conversation

@haikaljeh
Copy link
Copy Markdown

Problem

Two unbounded await sites in the runtime can hang indefinitely if the app-server stops responding mid-flight. An external user lost ~25 minutes when codex app-server accepted a connection and never replied — there was no timeout to surface the failure, so the plugin sat blocked until they killed the process by hand.

The two sites (verified by git blame; both have been latent since the initial commit f4d65d9):

  1. AppServerClientBase.request() in plugins/codex/scripts/lib/app-server.mjs — the returned Promise only settles when a message with the matching id reaches handleLine. If the app-server never sends a response for that id, the Promise never settles.
  2. captureTurn() in plugins/codex/scripts/lib/codex.mjs — awaits state.completion, which only resolves when turn/completed (or an inferred completion) fires. If notifications stop mid-turn, the await never resolves.

Fix

Two narrow, well-contained timeouts:

  • Per-request RPC wall-clock timeout. 30 s default, override via CODEX_APP_SERVER_RPC_TIMEOUT_MS (0 disables). On timeout, the pending entry is removed and the Promise rejects with a clear message naming the method and elapsed window. resolve and reject are wrapped to clear the timer on either settle path so a late-firing timer cannot reach into a recycled id.
  • Per-turn idle (no-progress) timeout. 240 s default, override via CODEX_TURN_IDLE_TIMEOUT_MS (0 disables). Any notification arrival resets the idle timer; a successful or failed turn-completed clears it via the existing finally. On expiry the turn rejects with the thread/turn ids in the message.

Both timers are unref'd and cleared on every settle path, so healthy turns and RPCs are unaffected.

Defaults rationale

  • 30 s RPC: longer than any healthy app-server RPC observed in practice, short enough that a wedged transport gets diagnosed quickly.
  • 240 s (4 min) turn idle: long enough that a slow but progressing turn won't false-alarm; short enough that a wedged stream gets caught before users notice.

Both can be tuned per-environment via env vars — operators who run long-tail diagnose workloads can extend, and CI can shorten.

Scope

This is intentionally the minimum diff that adds the two timeouts. No transport rewrite, no changes to broker / app-server-broker / codex-companion (caller layer), no refactors of the surrounding code. README docs for the new env vars are left as a follow-up.

Tests

Two new tests in tests/runtime.test.mjs exercise the new code paths via the existing fake-codex-fixture / spawn pattern:

  • task fails fast when an app-server JSON-RPC request never receives a response — fixture behavior hang-rpc-thread-start silently drops thread/start. With CODEX_APP_SERVER_RPC_TIMEOUT_MS=200, the task subprocess exits non-zero with timed out after 200ms in well under 2 s.
  • task fails fast when the app-server stops emitting notifications mid-turn — fixture behavior hang-after-turn-start responds to turn/start but never emits turn/completed. With CODEX_TURN_IDLE_TIMEOUT_MS=200, the task subprocess exits non-zero with turn idle timeout after 200ms in well under 2 s.

Local results (macOS, Node 22):

  • Pre-change baseline: 86 tests, 82 pass, 4 pre-existing failures unrelated to this change (status shows phases, hints, and the latest finished job, status preserves adversarial review kind labels, result returns the stored output for the latest finished job by default, resolveStateDir uses a temp-backed per-workspace directory — appear to be a sandbox/timing/macOS-tmpdir-realpath issue).
  • After change: 88 tests, 84 pass, same 4 pre-existing failures. No regressions; both new tests pass.

Diagnosis source

Diagnosed by an external user via codex's own diagnose flow, then independently verified against the release/v1.0.4 cache copy (active install) and upstream/main — files are byte-identical, so the bug applies to both shipped and tip-of-tree.

Diff size

4 files changed, +153/-1. Source-only change is ~66 LOC across the two lib/ files; the rest is the test fixture behavior + two new test cases.

…Turn

Two unbounded `await` sites in the runtime could hang indefinitely if
the app-server stopped responding mid-flight. A real-world user lost
~25 minutes when `codex app-server` accepted a connection and never
replied, with no timeout to surface the failure.

The two sites:

  1. `AppServerClientBase.request()` in
     `plugins/codex/scripts/lib/app-server.mjs` — the returned Promise
     resolves only when a matching `id` is delivered to the message
     handler. If the app-server never sends a response for that id,
     the Promise never settles.

  2. `captureTurn()` in `plugins/codex/scripts/lib/codex.mjs` — awaits
     `state.completion`, which only settles when `turn/completed` (or
     an inferred completion) fires. If the app-server stops emitting
     events mid-turn, the await never resolves.

This adds two narrow timeouts:

  * Per-request RPC wall-clock timeout, 30s default, override via
    `CODEX_APP_SERVER_RPC_TIMEOUT_MS` (set to `0` to disable). On
    timeout, the pending entry is removed and the Promise rejects with
    a clear message naming the method and elapsed window. Resolve and
    reject are wrapped to clear the timer on either path so the timer
    cannot fire after a normal response and reach into a recycled id.

  * Per-turn idle (no-progress) timeout, 240s default, override via
    `CODEX_TURN_IDLE_TIMEOUT_MS` (set to `0` to disable). Any
    notification arrival resets the idle timer; a successful or
    failed turn-completed clears it via the existing `finally`. On
    expiry the turn rejects with the thread/turn ids in the message.

Healthy turns and RPCs are unaffected — the timers are unrefed and
cleared on every settle path.

Tests: two fake-codex behaviors (`hang-rpc-thread-start` and
`hang-after-turn-start`) drive the new code paths via the existing
fixture/spawn pattern. With short overrides, both reject in well under
2 seconds. No source refactors outside the two timeout sites; no
changes to broker, lifecycle, or transport. Diagnosed by an external
user via codex's own diagnose flow against `f4d65d9`.
@haikaljeh haikaljeh requested a review from a team May 7, 2026 05:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant