fix(acp): clean up pending + cancel agent on abandoned prompts#760
fix(acp): clean up pending + cancel agent on abandoned prompts#760brettchien wants to merge 3 commits intoopenabdev:mainfrom
Conversation
OpenAB PR ScreeningThis is auto-generated by the OpenAB project-screening flow for context collection and reviewer handoff.
Screening report## IntentPR #760 is trying to fix an ACP prompt lifecycle bug where long-running prompts were incorrectly treated as dead after a flat 600-second receive timeout. The operator-visible problem is that OpenAB reports FeatThis is a bug fix for ACP request/session handling. It changes prompt execution behavior so long-running tools are not abandoned just because they exceed 600 seconds. Instead, OpenAB uses a liveness-check loop, abandons requests explicitly when needed, removes stale pending entries, best-effort sends It also adds a configurable hard ceiling: Who It ServesPrimary beneficiaries:
Rewritten PromptFix ACP prompt abandonment so stale responses from timed-out or dead prompts cannot corrupt later prompts. Replace the flat 600-second receive timeout with a loop that separately handles normal responses, periodic agent liveness checks, and a configurable hard timeout. When a prompt is abandoned for timeout, dead process, dropped channel, or similar failure, remove its pending request entry and best-effort send Add configuration for the hard timeout with a safe default, update example config, and verify existing tests and clippy. Add targeted tests if practical; otherwise document why subprocess-backed ACP behavior requires manual or integration coverage. Merge PitchThis is worth advancing because it fixes a concrete reliability failure: one abandoned ACP prompt can poison later prompts until the stale backlog clears. The fix addresses both the immediate timeout behavior and the deeper cleanup issue around pending request ownership. Risk profile is moderate. The change touches prompt dispatch, ACP connection state, config parsing, and process/session cancellation behavior. The main reviewer concern should be whether all abandon paths consistently call the new cleanup path, and whether Best-Practice ComparisonRelevant OpenClaw principles:
Relevant Hermes Agent principles:
Overall, the PR aligns most strongly with explicit delivery routing, isolated execution behavior, and gateway-owned lifecycle supervision. Implementation OptionsConservative option: minimal cleanup fix Balanced option: accept this PR’s direction Ambitious option: formal ACP request lifecycle manager Comparison Table
RecommendationAdvance the balanced approach from this PR, with reviewer focus on abandonment coverage, cancellation safety, and response-ID routing. This is the right next step because it fixes the observed production failure without forcing a larger ACP lifecycle redesign into the same merge. A good follow-up would be a separate PR adding a fake ACP backend or subprocess test harness so timeout, cancel, stale-response, and dead-agent cases can be covered automatically. |
🟢 Four-monk review — no blocking issuesVerdict: Approve. The 3-layer fix (select loop + abandon_request + stale id filtering) is well-structured and addresses the root cause of #732 cleanly. Highlights
Non-blocking NITs
Reviewer breakdown
|
- pool.liveness_check_secs: hoist the recv-loop poll cadence out of a
hard-coded const onto PoolConfig so deployments can tune it. Default
remains 30s.
- adapter: change hard-timeout error message from ({}m) to ({}s) so
non-multiple-of-60 ceilings render correctly (e.g. 90s → "(90s)").
- acp/connection: emit a tracing::trace! line when an id-bearing message
arrives whose pending entry was already abandoned. Behaviour is
unchanged — the adapter recv loop still filters by request_id; this
just makes the stale-response path observable at trace level.
cargo check + cargo clippy -- -D warnings + cargo test --bin openab all
clean (238 passed).
Follow-up — applied all 3 NITsThanks for the review. All three NITs addressed in
|
chaodu-agent
left a comment
There was a problem hiding this comment.
✅ Approve — all NITs addressed
Verified commit f323bb0 resolves the three NITs from my earlier review:
- ✅
liveness_check_secsnow configurable viaPoolConfig(was hardcoded 30s) - ✅
trace!(request_id = id, "stale id-bearing message after abandon")added for observability - ✅ Hard timeout format uses
as_secs()— no more0medge case
The 3-layer fix (select loop + abandon_request + stale id filtering) is solid. cargo test + clippy clean. Ready to merge.
…bdev#732) The flat 600s recv_timeout in adapter.rs:386 fires "Agent stopped responding" without removing pending[id] or sending session/cancel. The agent keeps running the abandoned prompt and eventually emits its final response with the original id. The reader at connection.rs:284 looks up pending[id], sees the now-stale entry, and forwards the message to the *current* notify_tx subscriber — which belongs to the next prompt. The next prompt's loop sees notification.id.is_some() and breaks immediately with empty text_buf, returning "(no response)". Each new prompt sent before the agent drains its backlog inherits the previous prompt's stale id and the cascade persists. Fix follows the issue's recommended A+B+C: (A) Replace flat 600s timeout with a tokio::select! loop in stream_prompt_blocks. Recv arm + 30s liveness arm. Liveness arm checks conn.alive() (cheap, just !reader_handle.is_finished()) and a configurable hard ceiling. Default ceiling is 30 min via pool.prompt_hard_timeout_secs. Long-running tools no longer trip the timeout — only a dead reader task or the hard ceiling abandon the prompt. (B) Add AcpConnection::abandon_request(request_id) called on every abandon path: drops pending[request_id] so a late response cannot route to a future subscriber, and best-effort writes session/cancel so the agent stops working on a request the broker has given up on. (C) Capture request_id from session_prompt() (was discarded as `_`) and skip notifications whose id doesn't match. Defense-in-depth at the routing layer; complements (B)'s cleanup if any future abandon path forgets to call abandon_request. No unit test for abandon_request — the connection has no test seam without spawning a real subprocess. Behavior is exercised end-to-end via the adapter loop on real ACP backends. Refs: - openabdev#76 (Assumption 2: prompts always complete) - openabdev#307 (sibling: same family, different visible symptom) - openabdev#470 (added the 600s recv timeout this issue exposes) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- pool.liveness_check_secs: hoist the recv-loop poll cadence out of a
hard-coded const onto PoolConfig so deployments can tune it. Default
remains 30s.
- adapter: change hard-timeout error message from ({}m) to ({}s) so
non-multiple-of-60 ceilings render correctly (e.g. 90s → "(90s)").
- acp/connection: emit a tracing::trace! line when an id-bearing message
arrives whose pending entry was already abandoned. Behaviour is
unchanged — the adapter recv loop still filters by request_id; this
just makes the stale-response path observable at trace level.
cargo check + cargo clippy -- -D warnings + cargo test --bin openab all
clean (238 passed).
f323bb0 to
c19371a
Compare
- Add ±liveness_check_secs precision note to prompt_hard_timeout_secs doc - Add JSON-RPC id field to session/cancel in abandon_request Co-authored-by: 超渡法師 <chaodu@openab.dev>
🟢 Review: fix(acp): clean up pending + cancel agent on abandoned promptsVerdict: LGTM — The fix is correct, well-structured, and addresses the root cause comprehensively. What this PR solvesThe flat 600s 🟢 INFO — Done well
🟡 NIT — Non-blocking
Baseline verificationVerified against
All three gaps are closed by this PR. Reviewed by 超渡法師 🪬 |
Closes #732.
Problem
Originally reported on Discord: https://discord.com/channels/1491295327620169908/1491365158868619404/1500930306620920040
Prompts running longer than 600 s triggered
recv_timeoutinadapter.rs:386. The broker printed "Agent stopped responding" and broke out of the recv loop, but:pending[request_id]was not removedsession/cancelwas sent — the agent kept runningid.connection.rs:286found the stalependingentry and forwarded it to the current prompt'snotify_tx— causing that prompt to break immediately with emptytext_buf→(no response)Fix (A + B + C)
(A) Replace flat 600 s timeout with a
tokio::select!loop:conn.alive()+ configurable hard ceiling(B)
AcpConnection::abandon_request(request_id)— dropspending[request_id]and best-effort sendssession/cancelon every abandon path(C) Capture
request_idfromsession_prompt()(was discarded as_); skip notifications whoseiddoesn't match — defense-in-depth if any future abandon path missesabandon_requestConfig
New
pool.prompt_hard_timeout_secs(default1800s = 30 min). Acts as a safety net against runaway sessions while the 30 s liveness check covers the typical case.Testing
cargo test --bin openab)cargo clippy -- -D warningscleanbash -c 'sleep 700 && echo done', wait past 600 s, send follow-up → real reply (no(no response)cascade)Agent process died, next prompt works cleanNo unit test for
abandon_request— no test seam without a real subprocess; covered end-to-end via real ACP backends.session/cancelshape matches existingcancel_session/reset_sessioninpool.rs(prod-verified).Refs
🤖 Generated with Claude Code