You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
All file paths and line numbers below reference upstream tag v0.8.3-beta.2 (commit 1fd36df).
Description
After the 600s recv_timeout in src/adapter.rs:345 fires (Agent stopped responding), subsequent prompts in the same thread return _(no response)_ — the thread stays soft-bricked for as long as the user keeps prompting before the agent drains the abandoned request, and in interactive use that is effectively "stuck until the connection is reset."
This is a follow-up to #470 (which added the timeout as defense-in-depth) and a concrete instance of #76's "Assumption 2" failure mode. Sibling to #307, which describes a different visible symptom (new @mentions dropped) caused by the same root family: the broker abandons waiting on an ACP request without telling the agent or cleaning up its bookkeeping.
The notification dispatcher in src/acp/connection.rs:284-303 routes any inbound message that has an id to the currently registered notify_tx — without checking whether that id corresponds to the request the current subscriber is waiting on.
Combined with the timeout in src/adapter.rs:345-354:
Prompt 1 is sent → session_prompt registers notify_tx = tx1 and pending[id=1] = oneshot. Request id=1 goes out on stdin.
Agent runs a long tool (e.g. Bash(npm run build)) that emits no ACP notifications for more than 10 minutes. Recv loop times out → response_error = "Agent stopped responding" → prompt_done() (sets notify_tx = None).
pending[id=1] is not removed.
No session/cancel is sent to the agent. The agent keeps running prompt 1 in the background.
Prompt 2 is sent → registers notify_tx = tx2, pending[id=2]. Sends id=2.
Agent eventually finishes prompt 1 and emits the final response (id=1). Reader at connection.rs:284 sees id=1, removes pending[1], and forwards the message to whichever subscriber is currently registered — now tx2.
The adapter.rs:356 loop reading from tx2 sees notification.id.is_some() and treats it as the completion of prompt 2, breaking immediately with empty text_buf.
final_content.is_empty() and response_error == None → outputs _(no response)_.
Prompt 2's true id=2 response arrives later. If a tx3 is already registered (because the user sent prompt 3 before id=2 returned), it is forwarded there and shorts out prompt 3 the same way. The cycle persists as long as each new prompt is sent before the previous prompt's true response arrives — easy to trigger when the agent is consistently behind by one prompt.
session_prompt() already returns (rx, request_id) — but adapter.rs:290 discards it: let (mut rx, _) = conn.session_prompt(content_blocks).await?;. So the subscriber has no way to filter mismatched ids.
Steps to Reproduce
In any thread, send a prompt that triggers a long-running tool with no intermediate output for more than 10 minutes. The simplest reproducer is a Bash tool running a slow build / sleep, e.g.:
run `bash -c 'sleep 700 && echo done'`
Wait for the broker to post ⚠️ Agent stopped responding.
Send any new prompt in the same thread (e.g. hi). Observe: response is _(no response)_.
Sending further prompts faster than the agent can return each true response keeps the cascade alive (each new prompt receives the previous prompt's stale id). Once the user pauses long enough for the agent to drain its backlog with notify_tx == None, the next prompt recovers — but in interactive use the thread is functionally stuck.
Expected Behavior
After a recv timeout, the agent process may still be alive but is considered abandoned by the broker. Subsequent prompts in the same thread should either:
be processed as new, independent prompts — and in the rare case the broker does abandon a prompt (process death, hard ceiling, etc.), stale responses from the abandoned prompt must be suppressed (request-id matching) and the abandoned prompt cancelled via session/cancel so the agent stops doing work the user has given up on, or
fail fast with a clear error (e.g. "session reset required") and trigger automatic session reset on the next turn.
The thread must not be silently soft-bricked into a (no response) cascade that is effectively stuck under any normal interactive pace.
Confirmed with Claude Code; likely reproduces with any ACP backend that can run a tool exceeding the 600s recv timeout without intermediate notifications, since the bug lives in broker-side notify_tx routing, not in any backend-specific behavior.
Suggested Fix
The recommended fix is all three: (A) + (B) + (C).
(A) replaces the flat 600s recv timeout with a liveness-aware loop, eliminating the typical trigger (long tools mis-classified as agent death).
(B) makes the rare remaining "abandon" paths safe by cancelling the agent-side prompt and clearing the orphaned pending entry whenever the broker gives up.
(C) is a cheap routing-layer invariant ("subscriber only accepts its own request_id") that prevents this family of cascade from re-emerging if any future code path forgets (B).
All three are in the broker; no agent-side change required.
(A) Replace flat timeout with liveness-aware loop — src/adapter.rs:343-360
Replace the flat 600s tokio::time::timeout around rx.recv() with a tokio::select! loop that distinguishes "agent process is alive but busy" from "agent process is actually stuck/dead":
let prompt_start = Instant::now();let hard_timeout = Duration::from_secs(30*60);// consider making configurable in a follow-uplet liveness_check = Duration::from_secs(30);loop{
tokio::select! {
msg = rx.recv() => {/* handle notification (or channel-closed → break) */}
_ = tokio::time::sleep(liveness_check) => {if !conn.alive(){
response_error = Some("Agent process died".into());// apply (B) cleanup before breakbreak;}if prompt_start.elapsed() > hard_timeout {
response_error = Some("Agent exceeded hard timeout".into());// apply (B) cleanup before breakbreak;}// alive and under ceiling → keep waiting}}}
AcpConnection::alive() already exists at src/acp/connection.rs:539 and is cheap (just !reader_handle.is_finished()). The hard ceiling acts purely as a safety net against runaway sessions, not as the primary liveness signal. tokio::mpsc::UnboundedReceiver::recv() is cancel-safe (tokio docs), so dropping the future when the sleep arm fires is safe.
This is the same shape proposed in the closed-unmerged PR #77, restricted here to just the loop-structure change. With (A), the timeout only fires when the agent really is gone — which is what Agent stopped responding should mean. Legitimate long-running tools (npm run build, large cargo test, slow migrations, sequential WebFetch) no longer trip the 10-minute timeout at all, which is what removes the trigger for the cascade described in this report.
(B) Clean up at every abandon point
Whenever the loop in (A) decides to abandon the in-flight prompt (process died, hard ceiling exceeded, or any future "give up" branch), do two things before breaking:
Call SessionPool::cancel_session() (already exists, lock-free via the stored stdin handle) so the agent stops working on the abandoned prompt.
Remove the orphaned pending[request_id] entry so any late response is dropped at the reader instead of being routed to a future subscriber.
This is what closes the cascade in the rare case (A)'s hard ceiling does fire while the agent is still emitting late responses. Without (B), the same race described in this report can still happen at 30 min instead of 10 min.
To enforce request-scoped routing as a defense-in-depth invariant, capture the request id from session_prompt() and ignore any completion whose id doesn't match the current request:
let(mut rx, request_id) = conn.session_prompt(content_blocks).await?;// ...ifletSome(notification_id) = notification.id{if notification_id != request_id {// Stale response from a previously-abandoned prompt; ignore.continue;}ifletSome(ref err) = notification.error{
response_error = Some(format_coded_error(err.code,&err.message));}break;}
Even with (A) + (B) in place, this invariant is cheap and prevents any future code path that abandons a prompt without going through (B) from re-introducing the same routing bug. session_prompt() already returns (rx, request_id) for exactly this purpose; today's let (mut rx, _) = ... discards it.
Fallback ranking if you can't ship all three
Listed best to worst — each is strictly weaker than (A)+(B)+(C):
(A) + (B) is the principled fix without (C)'s routing-layer invariant: timeouts fire only when the agent is gone, and when they do, both sides agree the prompt is dead. Any future "abandon" path that forgets (B)'s cleanup re-opens the cascade.
(B) alone keeps the 600s flat timeout (so long tools still trip it), but at least cleans up properly when it does — eliminates the cascade without addressing the misclassification.
(A) alone removes the trigger for the typical case (long tools) but leaves the underlying race in place — the cascade still reproduces if the agent ever truly hangs past the 30-min ceiling. Not recommended without (B).
(B) + (C) without (A) is the smallest patch that stops the production symptom on today's code, at the cost of continuing to mis-classify long tools as agent death.
(C) alone is the cheapest change and silences the visible (no response) symptom, but the agent still wastes tokens running prompts the broker has abandoned, and there is no session/cancel. Acceptable only as a stopgap.
Description
After the 600s
recv_timeoutinsrc/adapter.rs:345fires (Agent stopped responding), subsequent prompts in the same thread return_(no response)_— the thread stays soft-bricked for as long as the user keeps prompting before the agent drains the abandoned request, and in interactive use that is effectively "stuck until the connection is reset."This is a follow-up to #470 (which added the timeout as defense-in-depth) and a concrete instance of #76's "Assumption 2" failure mode. Sibling to #307, which describes a different visible symptom (new @mentions dropped) caused by the same root family: the broker abandons waiting on an ACP request without telling the agent or cleaning up its bookkeeping.
Relation to other issues
(no response)until the user pauses long enough for the abandoned request to drain throughnotify_tx == None.*sub = Nonechannel close), yet this bug still reproduces — proving it is a separate defect.notify_tximplicitly as connection-scoped (the connection's subscriber); this report argues it must be request-scoped — only forward responses whoseidmatches the current request. A fix at this layer would also harden New @mention in main channel ignored while existing session is waiting for ACP response #307's family of failures against future regressions.Root cause
The notification dispatcher in
src/acp/connection.rs:284-303routes any inbound message that has anidto the currently registerednotify_tx— without checking whether thatidcorresponds to the request the current subscriber is waiting on.Combined with the timeout in
src/adapter.rs:345-354:session_promptregistersnotify_tx = tx1andpending[id=1] = oneshot. Requestid=1goes out on stdin.Bash(npm run build)) that emits no ACP notifications for more than 10 minutes. Recv loop times out →response_error = "Agent stopped responding"→prompt_done()(setsnotify_tx = None).pending[id=1]is not removed.session/cancelis sent to the agent. The agent keeps running prompt 1 in the background.notify_tx = tx2,pending[id=2]. Sendsid=2.id=1). Reader atconnection.rs:284seesid=1, removespending[1], and forwards the message to whichever subscriber is currently registered — nowtx2.adapter.rs:356loop reading fromtx2seesnotification.id.is_some()and treats it as the completion of prompt 2, breaking immediately with emptytext_buf.final_content.is_empty()andresponse_error == None→ outputs_(no response)_.id=2response arrives later. If atx3is already registered (because the user sent prompt 3 before id=2 returned), it is forwarded there and shorts out prompt 3 the same way. The cycle persists as long as each new prompt is sent before the previous prompt's true response arrives — easy to trigger when the agent is consistently behind by one prompt.session_prompt()already returns(rx, request_id)— butadapter.rs:290discards it:let (mut rx, _) = conn.session_prompt(content_blocks).await?;. So the subscriber has no way to filter mismatched ids.Steps to Reproduce
Bashtool running a slow build / sleep, e.g.:⚠️ Agent stopped responding.hi). Observe: response is_(no response)_.id). Once the user pauses long enough for the agent to drain its backlog withnotify_tx == None, the next prompt recovers — but in interactive use the thread is functionally stuck.Expected Behavior
After a recv timeout, the agent process may still be alive but is considered abandoned by the broker. Subsequent prompts in the same thread should either:
session/cancelso the agent stops doing work the user has given up on, orThe thread must not be silently soft-bricked into a
(no response)cascade that is effectively stuck under any normal interactive pace.Environment
openabv0.8.3-beta.2(commit1fd36df, latest tag at time of writing); the same code path is present back tov0.8.2-beta.3when fix(acp): close notify channel on EOF to prevent stream hang #470 added the recv timeout.claude-agent-acp0.29.2 (Claude Code 2.1.114)notify_txrouting, not in any backend-specific behavior.Suggested Fix
The recommended fix is all three: (A) + (B) + (C).
pendingentry whenever the broker gives up.request_id") that prevents this family of cascade from re-emerging if any future code path forgets (B).All three are in the broker; no agent-side change required.
(A) Replace flat timeout with liveness-aware loop —
src/adapter.rs:343-360Replace the flat 600s
tokio::time::timeoutaroundrx.recv()with atokio::select!loop that distinguishes "agent process is alive but busy" from "agent process is actually stuck/dead":AcpConnection::alive()already exists atsrc/acp/connection.rs:539and is cheap (just!reader_handle.is_finished()). The hard ceiling acts purely as a safety net against runaway sessions, not as the primary liveness signal.tokio::mpsc::UnboundedReceiver::recv()is cancel-safe (tokio docs), so dropping the future when the sleep arm fires is safe.This is the same shape proposed in the closed-unmerged PR #77, restricted here to just the loop-structure change. With (A), the timeout only fires when the agent really is gone — which is what
Agent stopped respondingshould mean. Legitimate long-running tools (npm run build, largecargo test, slow migrations, sequentialWebFetch) no longer trip the 10-minute timeout at all, which is what removes the trigger for the cascade described in this report.(B) Clean up at every abandon point
Whenever the loop in (A) decides to abandon the in-flight prompt (process died, hard ceiling exceeded, or any future "give up" branch), do two things before breaking:
SessionPool::cancel_session()(already exists, lock-free via the stored stdin handle) so the agent stops working on the abandoned prompt.pending[request_id]entry so any late response is dropped at the reader instead of being routed to a future subscriber.This is what closes the cascade in the rare case (A)'s hard ceiling does fire while the agent is still emitting late responses. Without (B), the same race described in this report can still happen at 30 min instead of 10 min.
(C) Defensive request-id matching —
src/adapter.rs:290To enforce request-scoped routing as a defense-in-depth invariant, capture the request id from
session_prompt()and ignore any completion whose id doesn't match the current request:Even with (A) + (B) in place, this invariant is cheap and prevents any future code path that abandons a prompt without going through (B) from re-introducing the same routing bug.
session_prompt()already returns(rx, request_id)for exactly this purpose; today'slet (mut rx, _) = ...discards it.Fallback ranking if you can't ship all three
Listed best to worst — each is strictly weaker than (A)+(B)+(C):
(no response)symptom, but the agent still wastes tokens running prompts the broker has abandoned, and there is nosession/cancel. Acceptable only as a stopgap.