fix(runtime): add wall-clock timeouts to JSON-RPC request and captureTurn#302
Open
haikaljeh wants to merge 1 commit intoopenai:mainfrom
Open
fix(runtime): add wall-clock timeouts to JSON-RPC request and captureTurn#302haikaljeh wants to merge 1 commit intoopenai:mainfrom
haikaljeh wants to merge 1 commit intoopenai:mainfrom
Conversation
…Turn
Two unbounded `await` sites in the runtime could hang indefinitely if
the app-server stopped responding mid-flight. A real-world user lost
~25 minutes when `codex app-server` accepted a connection and never
replied, with no timeout to surface the failure.
The two sites:
1. `AppServerClientBase.request()` in
`plugins/codex/scripts/lib/app-server.mjs` — the returned Promise
resolves only when a matching `id` is delivered to the message
handler. If the app-server never sends a response for that id,
the Promise never settles.
2. `captureTurn()` in `plugins/codex/scripts/lib/codex.mjs` — awaits
`state.completion`, which only settles when `turn/completed` (or
an inferred completion) fires. If the app-server stops emitting
events mid-turn, the await never resolves.
This adds two narrow timeouts:
* Per-request RPC wall-clock timeout, 30s default, override via
`CODEX_APP_SERVER_RPC_TIMEOUT_MS` (set to `0` to disable). On
timeout, the pending entry is removed and the Promise rejects with
a clear message naming the method and elapsed window. Resolve and
reject are wrapped to clear the timer on either path so the timer
cannot fire after a normal response and reach into a recycled id.
* Per-turn idle (no-progress) timeout, 240s default, override via
`CODEX_TURN_IDLE_TIMEOUT_MS` (set to `0` to disable). Any
notification arrival resets the idle timer; a successful or
failed turn-completed clears it via the existing `finally`. On
expiry the turn rejects with the thread/turn ids in the message.
Healthy turns and RPCs are unaffected — the timers are unrefed and
cleared on every settle path.
Tests: two fake-codex behaviors (`hang-rpc-thread-start` and
`hang-after-turn-start`) drive the new code paths via the existing
fixture/spawn pattern. With short overrides, both reject in well under
2 seconds. No source refactors outside the two timeout sites; no
changes to broker, lifecycle, or transport. Diagnosed by an external
user via codex's own diagnose flow against `f4d65d9`.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Two unbounded
awaitsites in the runtime can hang indefinitely if the app-server stops responding mid-flight. An external user lost ~25 minutes whencodex app-serveraccepted a connection and never replied — there was no timeout to surface the failure, so the plugin sat blocked until they killed the process by hand.The two sites (verified by
git blame; both have been latent since the initial commitf4d65d9):AppServerClientBase.request()inplugins/codex/scripts/lib/app-server.mjs— the returned Promise only settles when a message with the matchingidreacheshandleLine. If the app-server never sends a response for that id, the Promise never settles.captureTurn()inplugins/codex/scripts/lib/codex.mjs— awaitsstate.completion, which only resolves whenturn/completed(or an inferred completion) fires. If notifications stop mid-turn, the await never resolves.Fix
Two narrow, well-contained timeouts:
CODEX_APP_SERVER_RPC_TIMEOUT_MS(0disables). On timeout, the pending entry is removed and the Promise rejects with a clear message naming the method and elapsed window.resolveandrejectare wrapped to clear the timer on either settle path so a late-firing timer cannot reach into a recycled id.CODEX_TURN_IDLE_TIMEOUT_MS(0disables). Any notification arrival resets the idle timer; a successful or failed turn-completed clears it via the existingfinally. On expiry the turn rejects with the thread/turn ids in the message.Both timers are
unref'd and cleared on every settle path, so healthy turns and RPCs are unaffected.Defaults rationale
Both can be tuned per-environment via env vars — operators who run long-tail diagnose workloads can extend, and CI can shorten.
Scope
This is intentionally the minimum diff that adds the two timeouts. No transport rewrite, no changes to broker / app-server-broker / codex-companion (caller layer), no refactors of the surrounding code. README docs for the new env vars are left as a follow-up.
Tests
Two new tests in
tests/runtime.test.mjsexercise the new code paths via the existingfake-codex-fixture/spawnpattern:task fails fast when an app-server JSON-RPC request never receives a response— fixture behaviorhang-rpc-thread-startsilently dropsthread/start. WithCODEX_APP_SERVER_RPC_TIMEOUT_MS=200, the task subprocess exits non-zero withtimed out after 200msin well under 2 s.task fails fast when the app-server stops emitting notifications mid-turn— fixture behaviorhang-after-turn-startresponds toturn/startbut never emitsturn/completed. WithCODEX_TURN_IDLE_TIMEOUT_MS=200, the task subprocess exits non-zero withturn idle timeout after 200msin well under 2 s.Local results (macOS, Node 22):
status shows phases, hints, and the latest finished job,status preserves adversarial review kind labels,result returns the stored output for the latest finished job by default,resolveStateDir uses a temp-backed per-workspace directory— appear to be a sandbox/timing/macOS-tmpdir-realpath issue).Diagnosis source
Diagnosed by an external user via codex's own diagnose flow, then independently verified against the
release/v1.0.4cache copy (active install) andupstream/main— files are byte-identical, so the bug applies to both shipped and tip-of-tree.Diff size
4 files changed, +153/-1. Source-only change is ~66 LOC across the two
lib/files; the rest is the test fixture behavior + two new test cases.