bug(acp): recv timeout on busy agent + stale-id routing → next prompt returns `(no response)` cascade


> All file paths and line numbers below reference upstream tag **`v0.8.3-beta.2`** (commit `1fd36df`).

### Description

After the 600s `recv_timeout` in `src/adapter.rs:345` fires (`Agent stopped responding`), **subsequent prompts in the same thread return `_(no response)_`** — the thread stays soft-bricked for as long as the user keeps prompting before the agent drains the abandoned request, and in interactive use that is effectively "stuck until the connection is reset."

This is a follow-up to #470 (which added the timeout as defense-in-depth) and a concrete instance of #76's "Assumption 2" failure mode. Sibling to #307, which describes a different visible symptom (new @mentions dropped) caused by the same root family: the broker abandons waiting on an ACP request without telling the agent or cleaning up its bookkeeping.

### Relation to other issues

- **#76** — listed three assumption failures; this report is a concrete subset of "Assumption 2: prompts always complete," but specifically about the *post-timeout state corruption* that the issue body did not call out.
- **#307** — **not a duplicate.** Same family (broker abandons ACP wait without coordinating cleanup with the agent), but distinct in three ways:
  - Visible symptom: #307 = new @mention in main channel silently dropped; this report = existing thread soft-bricked, subsequent turns return `(no response)` until the user pauses long enough for the abandoned request to drain through `notify_tx == None`.
  - Code path: #307's two analyses (CHC-Agent flowchart, chaodu-agent root-cause) blame the **pool global write lock** + **unclosed notify channel on EOF**. Both have since been fixed (per-connection serialization + #470's `*sub = None` channel close), yet this bug still reproduces — proving it is a separate defect.
  - Root posture: #307's analyses treated `notify_tx` implicitly as connection-scoped (the connection's subscriber); this report argues it must be **request-scoped** — only forward responses whose `id` matches the current request. A fix at this layer would also harden #307's family of failures against future regressions.
- **#470** — explicitly framed the 600s timeout as "defense-in-depth … even if other channel-closing bugs exist." This issue documents one of those "other bugs" the timeout exposes rather than fixes.

### Root cause

The notification dispatcher in `src/acp/connection.rs:284-303` routes any inbound message that has an `id` to **the currently registered `notify_tx`** — without checking whether that `id` corresponds to the request the current subscriber is waiting on.

Combined with the timeout in `src/adapter.rs:345-354`:

1. **Prompt 1** is sent → `session_prompt` registers `notify_tx = tx1` and `pending[id=1] = oneshot`. Request `id=1` goes out on stdin.
2. Agent runs a long tool (e.g. `Bash(npm run build)`) that emits no ACP notifications for more than 10 minutes. Recv loop times out → `response_error = "Agent stopped responding"` → `prompt_done()` (sets `notify_tx = None`).
   - **`pending[id=1]` is not removed.**
   - **No `session/cancel` is sent to the agent.** The agent keeps running prompt 1 in the background.
3. **Prompt 2** is sent → registers `notify_tx = tx2`, `pending[id=2]`. Sends `id=2`.
4. Agent eventually finishes prompt 1 and emits the final response (`id=1`). Reader at `connection.rs:284` sees `id=1`, removes `pending[1]`, and forwards the message to **whichever subscriber is currently registered — now `tx2`**.
5. The `adapter.rs:356` loop reading from `tx2` sees `notification.id.is_some()` and treats it as the completion of prompt 2, breaking immediately with empty `text_buf`.
6. `final_content.is_empty()` and `response_error == None` → outputs `_(no response)_`.
7. Prompt 2's true `id=2` response arrives later. If a `tx3` is already registered (because the user sent prompt 3 before id=2 returned), it is forwarded there and shorts out prompt 3 the same way. **The cycle persists as long as each new prompt is sent before the previous prompt's true response arrives** — easy to trigger when the agent is consistently behind by one prompt.

`session_prompt()` already returns `(rx, request_id)` — but `adapter.rs:290` discards it: `let (mut rx, _) = conn.session_prompt(content_blocks).await?;`. So the subscriber has no way to filter mismatched ids.

### Steps to Reproduce

1. In any thread, send a prompt that triggers a long-running tool with no intermediate output for more than 10 minutes. The simplest reproducer is a `Bash` tool running a slow build / sleep, e.g.:
   ```
   run `bash -c 'sleep 700 && echo done'`
   ```
2. Wait for the broker to post `⚠️ Agent stopped responding`.
3. Send any new prompt in the same thread (e.g. `hi`). Observe: response is `_(no response)_`.
4. Sending further prompts faster than the agent can return each true response keeps the cascade alive (each new prompt receives the previous prompt's stale `id`). Once the user pauses long enough for the agent to drain its backlog with `notify_tx == None`, the next prompt recovers — but in interactive use the thread is functionally stuck.

### Expected Behavior

After a recv timeout, the agent process may still be alive but is considered abandoned by the broker. Subsequent prompts in the same thread should either:

- **be processed as new, independent prompts** — and in the rare case the broker does abandon a prompt (process death, hard ceiling, etc.), stale responses from the abandoned prompt must be suppressed (request-id matching) and the abandoned prompt cancelled via `session/cancel` so the agent stops doing work the user has given up on, **or**
- **fail fast with a clear error** (e.g. "session reset required") and trigger automatic session reset on the next turn.

The thread must not be silently soft-bricked into a `(no response)` cascade that is effectively stuck under any normal interactive pace.

### Environment

- `openab` `v0.8.3-beta.2` (commit `1fd36df`, latest tag at time of writing); the same code path is present back to `v0.8.2-beta.3` when #470 added the recv timeout.
- ACP backend: `claude-agent-acp` 0.29.2 (Claude Code 2.1.114)
- Channel: Discord
- Confirmed with Claude Code; likely reproduces with any ACP backend that can run a tool exceeding the 600s recv timeout without intermediate notifications, since the bug lives in broker-side `notify_tx` routing, not in any backend-specific behavior.

### Suggested Fix

**The recommended fix is all three: (A) + (B) + (C).**

- **(A)** replaces the flat 600s recv timeout with a liveness-aware loop, eliminating the typical trigger (long tools mis-classified as agent death).
- **(B)** makes the rare remaining "abandon" paths safe by cancelling the agent-side prompt and clearing the orphaned `pending` entry whenever the broker gives up.
- **(C)** is a cheap routing-layer invariant ("subscriber only accepts its own `request_id`") that prevents this family of cascade from re-emerging if any future code path forgets (B).

All three are in the broker; no agent-side change required.

#### (A) Replace flat timeout with liveness-aware loop — `src/adapter.rs:343-360`

Replace the flat 600s `tokio::time::timeout` around `rx.recv()` with a `tokio::select!` loop that distinguishes "agent process is alive but busy" from "agent process is actually stuck/dead":

```rust
let prompt_start = Instant::now();
let hard_timeout = Duration::from_secs(30 * 60); // consider making configurable in a follow-up
let liveness_check = Duration::from_secs(30);
loop {
    tokio::select! {
        msg = rx.recv() => { /* handle notification (or channel-closed → break) */ }
        _ = tokio::time::sleep(liveness_check) => {
            if !conn.alive() {
                response_error = Some("Agent process died".into());
                // apply (B) cleanup before break
                break;
            }
            if prompt_start.elapsed() > hard_timeout {
                response_error = Some("Agent exceeded hard timeout".into());
                // apply (B) cleanup before break
                break;
            }
            // alive and under ceiling → keep waiting
        }
    }
}
```

`AcpConnection::alive()` already exists at `src/acp/connection.rs:539` and is cheap (just `!reader_handle.is_finished()`). The hard ceiling acts purely as a safety net against runaway sessions, not as the primary liveness signal. `tokio::mpsc::UnboundedReceiver::recv()` is cancel-safe ([tokio docs](https://docs.rs/tokio/latest/tokio/sync/mpsc/struct.UnboundedReceiver.html#cancel-safety)), so dropping the future when the sleep arm fires is safe.

This is the same shape proposed in the closed-unmerged PR #77, restricted here to just the loop-structure change. With (A), the timeout only fires when the agent really is gone — which is what `Agent stopped responding` should mean. Legitimate long-running tools (`npm run build`, large `cargo test`, slow migrations, sequential `WebFetch`) no longer trip the 10-minute timeout at all, which is what removes the trigger for the cascade described in this report.

#### (B) Clean up at every abandon point

Whenever the loop in (A) decides to abandon the in-flight prompt (process died, hard ceiling exceeded, or any future "give up" branch), do two things before breaking:

- Call `SessionPool::cancel_session()` (already exists, lock-free via the stored stdin handle) so the agent stops working on the abandoned prompt.
- Remove the orphaned `pending[request_id]` entry so any late response is dropped at the reader instead of being routed to a future subscriber.

This is what closes the cascade in the rare case (A)'s hard ceiling does fire while the agent is still emitting late responses. Without (B), the same race described in this report can still happen at 30 min instead of 10 min.

#### (C) Defensive request-id matching — `src/adapter.rs:290`

To enforce request-scoped routing as a defense-in-depth invariant, capture the request id from `session_prompt()` and ignore any completion whose id doesn't match the current request:

```rust
let (mut rx, request_id) = conn.session_prompt(content_blocks).await?;
// ...
if let Some(notification_id) = notification.id {
    if notification_id != request_id {
        // Stale response from a previously-abandoned prompt; ignore.
        continue;
    }
    if let Some(ref err) = notification.error {
        response_error = Some(format_coded_error(err.code, &err.message));
    }
    break;
}
```

Even with (A) + (B) in place, this invariant is cheap and prevents any future code path that abandons a prompt without going through (B) from re-introducing the same routing bug. `session_prompt()` already returns `(rx, request_id)` for exactly this purpose; today's `let (mut rx, _) = ...` discards it.

---

#### Fallback ranking if you can't ship all three

Listed best to worst — each is strictly weaker than (A)+(B)+(C):

- **(A) + (B)** is the principled fix without (C)'s routing-layer invariant: timeouts fire only when the agent is gone, and when they do, both sides agree the prompt is dead. Any future "abandon" path that forgets (B)'s cleanup re-opens the cascade.
- **(B) alone** keeps the 600s flat timeout (so long tools still trip it), but at least cleans up properly when it does — eliminates the cascade without addressing the misclassification.
- **(A) alone** removes the trigger for the typical case (long tools) but leaves the underlying race in place — the cascade still reproduces if the agent ever truly hangs past the 30-min ceiling. Not recommended without (B).
- **(B) + (C) without (A)** is the smallest patch that stops the production symptom on today's code, at the cost of continuing to mis-classify long tools as agent death.
- **(C) alone** is the cheapest change and silences the visible `(no response)` symptom, but the agent still wastes tokens running prompts the broker has abandoned, and there is no `session/cancel`. Acceptable only as a stopgap.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug(acp): recv timeout on busy agent + stale-id routing → next prompt returns `(no response)` cascade #732

Description

Relation to other issues

Root cause

Steps to Reproduce

Expected Behavior

Environment

Suggested Fix

(A) Replace flat timeout with liveness-aware loop — `src/adapter.rs:343-360`

(B) Clean up at every abandon point

(C) Defensive request-id matching — `src/adapter.rs:290`

Fallback ranking if you can't ship all three

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

bug(acp): recv timeout on busy agent + stale-id routing → next prompt returns (no response) cascade #732

Description

Description

Relation to other issues

Root cause

Steps to Reproduce

Expected Behavior

Environment

Suggested Fix

(A) Replace flat timeout with liveness-aware loop — src/adapter.rs:343-360

(B) Clean up at every abandon point

(C) Defensive request-id matching — src/adapter.rs:290

Fallback ranking if you can't ship all three

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

bug(acp): recv timeout on busy agent + stale-id routing → next prompt returns `(no response)` cascade #732

(A) Replace flat timeout with liveness-aware loop — `src/adapter.rs:343-360`

(C) Defensive request-id matching — `src/adapter.rs:290`