fix(acp): clean up pending + cancel agent on abandoned prompts by brettchien · Pull Request #760 · openabdev/openab

brettchien · 2026-05-06T14:55:06Z

Closes #732.

Problem

Originally reported on Discord: https://discord.com/channels/1491295327620169908/1491365158868619404/1500930306620920040

Prompts running longer than 600 s triggered recv_timeout in adapter.rs:386. The broker printed "Agent stopped responding" and broke out of the recv loop, but:

pending[request_id] was not removed
No session/cancel was sent — the agent kept running
When the agent eventually finished, it emitted a response with the original id. connection.rs:286 found the stale pending entry and forwarded it to the current prompt's notify_tx — causing that prompt to break immediately with empty text_buf → (no response)
The cascade repeated for every subsequent prompt until the backlog drained

Fix (A + B + C)

(A) Replace flat 600 s timeout with a tokio::select! loop:

recv arm — normal path
30 s liveness arm — checks conn.alive() + configurable hard ceiling
Long-running tools no longer trip the timeout; only a dead reader or the hard ceiling abandons the prompt

(B) AcpConnection::abandon_request(request_id) — drops pending[request_id] and best-effort sends session/cancel on every abandon path

(C) Capture request_id from session_prompt() (was discarded as _); skip notifications whose id doesn't match — defense-in-depth if any future abandon path misses abandon_request

Config

New pool.prompt_hard_timeout_secs (default 1800 s = 30 min). Acts as a safety net against runaway sessions while the 30 s liveness check covers the typical case.

Testing

238 existing tests pass (cargo test --bin openab)
cargo clippy -- -D warnings clean
Manual repro: bash -c 'sleep 700 && echo done', wait past 600 s, send follow-up → real reply (no (no response) cascade)
Manual: kill agent mid-prompt → recv loop bails within ~30 s with Agent process died, next prompt works clean

No unit test for abandon_request — no test seam without a real subprocess; covered end-to-end via real ACP backends. session/cancel shape matches existing cancel_session / reset_session in pool.rs (prod-verified).

Refs

fix: notification loop assumes ordered events, bounded prompts, and managed session lifecycle — none hold in production #76 (Assumption 2: prompts always complete)
New @mention in main channel ignored while existing session is waiting for ACP response #307 (sibling symptom)
fix(acp): close notify channel on EOF to prevent stream hang #470 (introduced the 600 s timeout)

🤖 Generated with Claude Code

shaun-agent · 2026-05-06T15:31:29Z

OpenAB PR Screening

This is auto-generated by the OpenAB project-screening flow for context collection and reviewer handoff.
Click 👍 if you find this useful. Human review will be done within 24 hours. We appreciate your support and contribution 🙏

Title: fix(acp): clean up pending + cancel agent on abandoned prompts
Source: fix(acp): clean up pending + cancel agent on abandoned prompts #760
Status: moved to PR-Screening
Generated at: 2026-05-06T15:31:29.327Z
Discord thread: https://discord.com/channels/1488041051187974246/1501607147782279412

Screening report

## Intent

PR #760 is trying to fix an ACP prompt lifecycle bug where long-running prompts were incorrectly treated as dead after a flat 600-second receive timeout.

The operator-visible problem is that OpenAB reports Agent stopped responding, but does not fully clean up the abandoned request. The stale pending request can later receive an old agent response and incorrectly deliver it into a newer prompt, causing (no response) cascades for subsequent prompts.

Feat

This is a bug fix for ACP request/session handling.

It changes prompt execution behavior so long-running tools are not abandoned just because they exceed 600 seconds. Instead, OpenAB uses a liveness-check loop, abandons requests explicitly when needed, removes stale pending entries, best-effort sends session/cancel, and ignores mismatched response IDs as defense in depth.

It also adds a configurable hard ceiling: pool.prompt_hard_timeout_secs, defaulting to 1800 seconds.

Who It Serves

Primary beneficiaries:

Agent runtime operators
Discord and Slack end users whose prompts depend on ACP agents
Maintainers debugging long-running or stuck ACP sessions
Reviewers responsible for reliability of prompt dispatch and session cleanup

Rewritten Prompt

Fix ACP prompt abandonment so stale responses from timed-out or dead prompts cannot corrupt later prompts.

Replace the flat 600-second receive timeout with a loop that separately handles normal responses, periodic agent liveness checks, and a configurable hard timeout. When a prompt is abandoned for timeout, dead process, dropped channel, or similar failure, remove its pending request entry and best-effort send session/cancel. Preserve the request ID returned by session_prompt() and ensure notifications are only accepted when their response ID matches the active request.

Add configuration for the hard timeout with a safe default, update example config, and verify existing tests and clippy. Add targeted tests if practical; otherwise document why subprocess-backed ACP behavior requires manual or integration coverage.

Merge Pitch

This is worth advancing because it fixes a concrete reliability failure: one abandoned ACP prompt can poison later prompts until the stale backlog clears. The fix addresses both the immediate timeout behavior and the deeper cleanup issue around pending request ownership.

Risk profile is moderate. The change touches prompt dispatch, ACP connection state, config parsing, and process/session cancellation behavior. The main reviewer concern should be whether all abandon paths consistently call the new cleanup path, and whether session/cancel is safe to send for every abandoned request.

Best-Practice Comparison

Relevant OpenClaw principles:

Gateway-owned scheduling: partially relevant. This PR improves gateway-side ownership of prompt lifecycle decisions.
Durable job persistence: not directly relevant. The bug is in live ACP request routing, not persisted jobs.
Isolated executions: relevant in spirit. Stale responses should not affect later prompts.
Explicit delivery routing: highly relevant. Matching response IDs to active request IDs prevents cross-prompt delivery.
Retry/backoff and run logs: not directly addressed. Better logs around abandon/cancel paths may be useful follow-up.

Relevant Hermes Agent principles:

Gateway daemon tick model: partially relevant. The new 30-second liveness check resembles periodic supervision.
File locking to prevent overlap: not relevant unless ACP sessions are persisted or shared across processes.
Atomic writes for persisted state: not relevant.
Fresh session per scheduled run: conceptually relevant. The fix prevents stale session output from contaminating later work, though it does not create fresh sessions per prompt.
Self-contained prompts for scheduled tasks: not relevant to this specific PR.

Overall, the PR aligns most strongly with explicit delivery routing, isolated execution behavior, and gateway-owned lifecycle supervision.

Implementation Options

Conservative option: minimal cleanup fix
Keep the existing timeout shape but ensure every timeout path removes pending[request_id], sends best-effort session/cancel, and drops mismatched response IDs. This is smaller but still leaves long-running valid tools vulnerable to the flat timeout.

Balanced option: accept this PR’s direction
Replace the flat timeout with recv/liveness/hard-timeout supervision, add abandon_request, preserve request IDs, and reject mismatched notifications. This fixes the known cascade while keeping the design local to ACP prompt handling.

Ambitious option: formal ACP request lifecycle manager
Introduce an explicit request state machine with states like pending, active, abandoned, cancel_sent, completed, and failed. Add structured run logs, metrics, integration tests with a fake ACP subprocess, and centralized cancellation semantics across prompt, reset, and session shutdown paths.

Comparison Table

Option	Speed to ship	Complexity	Reliability	Maintainability	User impact	Fit for OpenAB right now
Conservative cleanup only	High	Low	Medium	Medium	Medium	Decent, but incomplete
Balanced PR approach	Medium	Medium	High	Good	High	Best fit
Full lifecycle manager	Low	High	Very high	High if done well	High	Better as follow-up

Recommendation

Advance the balanced approach from this PR, with reviewer focus on abandonment coverage, cancellation safety, and response-ID routing.

This is the right next step because it fixes the observed production failure without forcing a larger ACP lifecycle redesign into the same merge. A good follow-up would be a separate PR adding a fake ACP backend or subprocess test harness so timeout, cancel, stale-response, and dead-agent cases can be covered automatically.

chaodu-agent · 2026-05-06T16:05:48Z

🟢 Four-monk review — no blocking issues

Verdict: Approve. The 3-layer fix (select loop + abandon_request + stale id filtering) is well-structured and addresses the root cause of #732 cleanly.

Highlights

tokio::select! replaces the flat 600s timeout — long-running tools no longer get killed; only a dead reader or the hard ceiling triggers abandon
abandon_request() removes pending[request_id] and best-effort sends session/cancel (confirmed: ACP spec defines this as a client notification, no id, no response expected)
Stale id filtering provides defense-in-depth against any future missed abandon path
None => break path is safe — reader loop already drains all pending entries on EOF

Non-blocking NITs

#	Note	Suggestion
1	`LIVENESS_CHECK_INTERVAL` hardcoded at 30s	Could be configurable for operators; fine as follow-up
2	`session/cancel` response silently dropped if agent replies	Add `trace!` log when id-bearing response has no pending entry — improves observability
3	Hard timeout format shows `0m` if < 60s	Edge case; consider `as_secs()` display or min-value guard in config

Reviewer breakdown

Reviewer	Verdict	Extra observations
超渡法師 (Claude)	🟢	Baseline check confirmed leak path; verified `tokio::select!` race is safe for UnboundedReceiver
普渡法師 (Claude)	🟢	Verified `None => break` assumption (reader drains pending on EOF); flagged format edge case
擺渡法師 (Codex)	🟢	Confirmed `session/cancel` is spec-compliant notification; GitHub checks passing
覺渡法師 (Gemini)	🟢	Praised select loop design; agreed on trace log suggestion

- pool.liveness_check_secs: hoist the recv-loop poll cadence out of a hard-coded const onto PoolConfig so deployments can tune it. Default remains 30s. - adapter: change hard-timeout error message from ({}m) to ({}s) so non-multiple-of-60 ceilings render correctly (e.g. 90s → "(90s)"). - acp/connection: emit a tracing::trace! line when an id-bearing message arrives whose pending entry was already abandoned. Behaviour is unchanged — the adapter recv loop still filters by request_id; this just makes the stale-response path observable at trace level. cargo check + cargo clippy -- -D warnings + cargo test --bin openab all clean (238 passed).

brettchien · 2026-05-06T17:17:46Z

Follow-up — applied all 3 NITs

Thanks for the review. All three NITs addressed in f323bb0.

#	NIT	Resolution
1	`LIVENESS_CHECK_INTERVAL` hardcoded at 30s	Hoisted onto `PoolConfig` as `pool.liveness_check_secs` (default 30, documented in `config.toml.example`). Wired through `AdapterRouter::new`.
2	id-bearing response has no pending entry (stale-id observability)	Added `trace!(request_id = id, "stale id-bearing message after abandon")` at the fall-through path in `acp/connection.rs`. Behaviour unchanged — the message still falls through to subscriber forwarding; the adapter recv loop's `request_id` filter remains the actual safety net. The trace just makes the stale path observable.
3	Hard timeout format shows `0m` if < 60s	Format changed from `({}m)` / `as_secs() / 60` to `({}s)` / `as_secs()`. Renders correctly for any positive ceiling (e.g. `90s` instead of `1m`).

cargo check && cargo clippy -- -D warnings && cargo test --bin openab clean (238 passed).

chaodu-agent

✅ Approve — all NITs addressed

Verified commit f323bb0 resolves the three NITs from my earlier review:

✅ liveness_check_secs now configurable via PoolConfig (was hardcoded 30s)
✅ trace!(request_id = id, "stale id-bearing message after abandon") added for observability
✅ Hard timeout format uses as_secs() — no more 0m edge case

The 3-layer fix (select loop + abandon_request + stale id filtering) is solid. cargo test + clippy clean. Ready to merge.

…bdev#732) The flat 600s recv_timeout in adapter.rs:386 fires "Agent stopped responding" without removing pending[id] or sending session/cancel. The agent keeps running the abandoned prompt and eventually emits its final response with the original id. The reader at connection.rs:284 looks up pending[id], sees the now-stale entry, and forwards the message to the *current* notify_tx subscriber — which belongs to the next prompt. The next prompt's loop sees notification.id.is_some() and breaks immediately with empty text_buf, returning "(no response)". Each new prompt sent before the agent drains its backlog inherits the previous prompt's stale id and the cascade persists. Fix follows the issue's recommended A+B+C: (A) Replace flat 600s timeout with a tokio::select! loop in stream_prompt_blocks. Recv arm + 30s liveness arm. Liveness arm checks conn.alive() (cheap, just !reader_handle.is_finished()) and a configurable hard ceiling. Default ceiling is 30 min via pool.prompt_hard_timeout_secs. Long-running tools no longer trip the timeout — only a dead reader task or the hard ceiling abandon the prompt. (B) Add AcpConnection::abandon_request(request_id) called on every abandon path: drops pending[request_id] so a late response cannot route to a future subscriber, and best-effort writes session/cancel so the agent stops working on a request the broker has given up on. (C) Capture request_id from session_prompt() (was discarded as `_`) and skip notifications whose id doesn't match. Defense-in-depth at the routing layer; complements (B)'s cleanup if any future abandon path forgets to call abandon_request. No unit test for abandon_request — the connection has no test seam without spawning a real subprocess. Behavior is exercised end-to-end via the adapter loop on real ACP backends. Refs: - openabdev#76 (Assumption 2: prompts always complete) - openabdev#307 (sibling: same family, different visible symptom) - openabdev#470 (added the 600s recv timeout this issue exposes) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

- pool.liveness_check_secs: hoist the recv-loop poll cadence out of a hard-coded const onto PoolConfig so deployments can tune it. Default remains 30s. - adapter: change hard-timeout error message from ({}m) to ({}s) so non-multiple-of-60 ceilings render correctly (e.g. 90s → "(90s)"). - acp/connection: emit a tracing::trace! line when an id-bearing message arrives whose pending entry was already abandoned. Behaviour is unchanged — the adapter recv loop still filters by request_id; this just makes the stale-response path observable at trace level. cargo check + cargo clippy -- -D warnings + cargo test --bin openab all clean (238 passed).

- Add ±liveness_check_secs precision note to prompt_hard_timeout_secs doc - Add JSON-RPC id field to session/cancel in abandon_request Co-authored-by: 超渡法師 <chaodu@openab.dev>

chaodu-agent · 2026-05-06T20:34:35Z

🟢 Review: fix(acp): clean up pending + cancel agent on abandoned prompts

Verdict: LGTM — The fix is correct, well-structured, and addresses the root cause comprehensively.

What this PR solves

The flat 600s recv_timeout fired "Agent stopped responding" without removing pending[request_id] or sending session/cancel. Late responses leaked into subsequent prompts causing (no response) cascades. The three-layer fix (A: liveness loop, B: abandon cleanup, C: stale ID filter) is robust.

🟢 INFO — Done well

Root cause correctly identified and all three vectors closed
tokio::select! loop is idiomatic — no spurious wakeups
abandon_request is fire-and-forget safe (errors swallowed, pending removed first)
Config defaults reasonable (30s liveness, 30min hard ceiling)
Backward-compatible — #[serde(default)] means existing configs work unchanged
Trace-level log for stale messages — good observability without spam

🟡 NIT — Non-blocking

Liveness vs hard timeout clamp: If liveness_check_secs > prompt_hard_timeout_secs, the hard timeout effectively never fires. Consider a startup warning or clamp.
Cancel response noise: abandon_request allocates a new request ID for session/cancel. The response will hit the "stale id" trace log. Harmless but a comment noting this is intentional would help future readers.
std::time::Instant vs tokio::time::Instant: prompt_start uses std::time::Instant inside async context. Fine for elapsed checks, but tokio::time::Instant is preferred for consistency (especially under tokio::time::pause() in tests).
No unit test for stale-ID filter: Acceptable given no test seam without a subprocess. A code comment noting it is covered by manual repro would help.

Baseline verification

Verified against main:

adapter.rs:386 — confirmed flat 600s recv_timeout with no cleanup
connection.rs — confirmed session_prompt() returns (rx, request_id) but caller discarded ID as _
connection.rs:319 — confirmed pending lookup forwards to current subscriber without ID validation

All three gaps are closed by this PR.

Reviewed by 超渡法師 🪬

brettchien requested a review from thepagent as a code owner May 6, 2026 14:55

github-actions Bot added closing-soon PR missing Discord Discussion URL — will auto-close in 3 days pending-screening PR awaiting automated screening and removed closing-soon PR missing Discord Discussion URL — will auto-close in 3 days labels May 6, 2026

github-actions Bot added the pending-maintainer label May 6, 2026

chaodu-agent added pending-contributor and removed pending-maintainer labels May 6, 2026

github-actions Bot added pending-maintainer and removed pending-contributor labels May 6, 2026

chaodu-agent previously approved these changes May 6, 2026

View reviewed changes

chaodu-agent removed the pending-maintainer label May 6, 2026

brettchien and others added 2 commits May 6, 2026 20:09

chaodu-agent dismissed their stale review via c19371a May 6, 2026 20:10

chaodu-agent force-pushed the fix/issue-732-recv-timeout-cleanup branch from f323bb0 to c19371a Compare May 6, 2026 20:10

fix(acp): add precision doc + id to session/cancel

0a8fc0a

- Add ±liveness_check_secs precision note to prompt_hard_timeout_secs doc - Add JSON-RPC id field to session/cancel in abandon_request Co-authored-by: 超渡法師 <chaodu@openab.dev>

github-actions Bot added the pending-maintainer label May 6, 2026

chaodu-agent added pending-contributor and removed pending-maintainer pending-screening PR awaiting automated screening pending-contributor labels May 6, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(acp): clean up pending + cancel agent on abandoned prompts#760

fix(acp): clean up pending + cancel agent on abandoned prompts#760
brettchien wants to merge 3 commits intoopenabdev:mainfrom
brettchien:fix/issue-732-recv-timeout-cleanup

brettchien commented May 6, 2026 •

edited

Loading

Uh oh!

shaun-agent commented May 6, 2026

Feat

Who It Serves

Rewritten Prompt

Merge Pitch

Best-Practice Comparison

Implementation Options

Comparison Table

Recommendation

Uh oh!

chaodu-agent commented May 6, 2026

Uh oh!

brettchien commented May 6, 2026

Uh oh!

chaodu-agent left a comment

Uh oh!

chaodu-agent commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

brettchien commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix (A + B + C)

Config

Testing

Refs

Uh oh!

shaun-agent commented May 6, 2026

OpenAB PR Screening

Feat

Who It Serves

Rewritten Prompt

Merge Pitch

Best-Practice Comparison

Implementation Options

Comparison Table

Recommendation

Uh oh!

chaodu-agent commented May 6, 2026

🟢 Four-monk review — no blocking issues

Highlights

Non-blocking NITs

Uh oh!

brettchien commented May 6, 2026

Follow-up — applied all 3 NITs

Uh oh!

chaodu-agent left a comment

Choose a reason for hiding this comment

✅ Approve — all NITs addressed

Uh oh!

chaodu-agent commented May 6, 2026

🟢 Review: fix(acp): clean up pending + cancel agent on abandoned prompts

What this PR solves

🟢 INFO — Done well

🟡 NIT — Non-blocking

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

brettchien commented May 6, 2026 •

edited

Loading