Skip to content

feat(acp): surface agent stderr and add startup health check#752

Open
apple8409 wants to merge 2 commits intoopenabdev:mainfrom
apple8409:main
Open

feat(acp): surface agent stderr and add startup health check#752
apple8409 wants to merge 2 commits intoopenabdev:mainfrom
apple8409:main

Conversation

@apple8409
Copy link
Copy Markdown

Summary

  • Pipe agent stderr instead of discarding it (/dev/null). Each line is logged at WARN level (openab::agent target) and buffered in an 8-line ring buffer per connection.
  • Richer connection-closed errors: when an agent process exits unexpectedly, the last buffered stderr lines are appended to the error message — e.g. connection closed; agent stderr: env: node: No such file or directory — making the root cause self-contained in the log without needing to cross-reference a separate error log file.
  • Startup health check (SessionPool::health_check): on boot, OpenAB spawns a temporary agent connection and runs initialize + session/new. A clear ERROR is logged if the agent cannot be reached, instead of failing silently on the first user message.

Motivation

Diagnosing agent spawn failures was difficult because:

  1. stderr was routed to /dev/null, swallowing all agent error output
  2. The only visible symptom was a generic JSON-RPC error -1: connection closed appearing only after a user message triggered a dispatch
  3. No distinction between "bot connected to Discord" and "bot can actually reach the agent"

A concrete example: on macOS, launchd services run with a stripped PATH (/usr/bin:/bin:/usr/sbin:/sbin). If the agent binary uses #!/usr/bin/env node and Node.js is installed via Homebrew (/opt/homebrew/bin), the process exits immediately with env: node: No such file or directory — previously invisible, now surfaced in the log.

Test plan

  • Verify agent health check passed appears in log on normal startup
  • Temporarily break agent path or remove node from PATH, confirm log shows agent health check FAILED: connection closed; agent stderr: env: node: No such file or directory
  • Confirm normal agent stderr (e.g. Ripgrep warnings) appears as WARN [openab::agent] [stderr] ... in bot log
  • Confirm bot still functions correctly end-to-end after these changes

🤖 Generated with Claude Code

- Pipe agent stderr instead of discarding it; each line is logged at
  WARN level (openab::agent target) and buffered in a 8-line ring buffer
- On connection close, append the buffered stderr tail to the error
  message so "connection closed; agent stderr: ..." is self-contained
- Add SessionPool::health_check() that runs initialize + session/new
  at startup and logs a clear ERROR if the agent cannot be reached,
  instead of failing silently on the first user message

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@apple8409 apple8409 requested a review from thepagent as a code owner May 6, 2026 01:59
@github-actions github-actions Bot added pending-screening PR awaiting automated screening closing-soon PR missing Discord Discussion URL — will auto-close in 3 days labels May 6, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 6, 2026

⚠️ This PR is missing a Discord Discussion URL in the body.

All PRs must reference a prior Discord discussion to ensure community alignment before implementation.

Please edit the PR description to include a link like:

Discord Discussion URL: https://discord.com/channels/...

This PR will be automatically closed in 3 days if the link is not added.

@shaun-agent
Copy link
Copy Markdown
Contributor

OpenAB PR Screening

This is auto-generated by the OpenAB project-screening flow for context collection and reviewer handoff.
Click 👍 if you find this useful. Human review will be done within 24 hours. We appreciate your support and contribution 🙏

Screening report ## Intent

PR #752 is trying to make ACP agent startup and connection failures diagnosable from the main OpenAB logs.

The operator-visible problem is that agent process failures currently collapse into generic connection errors like JSON-RPC error -1: connection closed, while the actual cause may be hidden on stderr or discarded entirely. This especially hurts service deployments where environment differences, such as a stripped PATH, cause the agent process to exit before handling requests.

Feat

This is a feature / observability improvement for ACP agent runtime operations.

It changes behavior by:

  • piping agent stderr into OpenAB logs instead of discarding it
  • buffering recent stderr lines per connection
  • appending recent stderr to connection-closed errors
  • adding a startup health check that runs initialize and session/new against a temporary agent connection on boot

Who It Serves

Primary beneficiaries: agent runtime operators and deployers.

Secondary beneficiaries: maintainers and reviewers debugging failed OpenAB deployments, especially Discord bot deployments where the bot may connect successfully while the backing agent is broken.

Rewritten Prompt

Implement ACP agent diagnostics for startup and unexpected process exit failures.

Update the ACP connection layer so spawned agent stderr is captured, logged line-by-line at WARN under the openab::agent target, and stored in a small per-connection ring buffer. When the agent connection closes unexpectedly, include the most recent stderr lines in the returned error message.

Add a SessionPool::health_check method that creates a temporary ACP connection during application startup and verifies the agent can complete initialize and session/new. Log a clear startup error if the check fails, without requiring the first user message to reveal the problem.

Keep the implementation scoped to diagnostics. Add or document verification for successful startup, broken agent executable/path, stderr logging, and normal end-to-end agent use.

Merge Pitch

This is worth advancing because it turns a high-friction operational failure mode into a self-contained log message. The change improves deployability without changing the user-facing ACP workflow.

Risk profile is moderate-low: the main behavior change is around process stderr handling and startup probing. The likely reviewer concern is whether stderr capture can deadlock, leak output across sessions, over-log noisy tools, or cause startup failure semantics that are too aggressive for production.

Best-Practice Comparison

OpenClaw principles that apply:

  • Explicit run logs: relevant. Surfacing stderr in the main log improves operator diagnosis.
  • Isolated executions: partially relevant. Each connection maintaining its own stderr buffer matches the idea that execution state should not bleed across runs.
  • Retry/backoff and run logs: retry/backoff is not directly addressed, but this PR improves the log side needed before retry policy is useful.
  • Gateway-owned scheduling and durable job persistence: not directly relevant. This PR is not about scheduling or job durability.
  • Explicit delivery routing: not relevant.

Hermes Agent principles that apply:

  • Gateway daemon tick model: not directly relevant.
  • File locking to prevent overlap: not relevant.
  • Atomic writes for persisted state: not relevant because this PR does not persist state.
  • Fresh session per scheduled run: partially relevant. The health check uses a temporary connection/session instead of contaminating user sessions.
  • Self-contained prompts for scheduled tasks: not relevant, but the same self-contained-diagnostics principle applies: failures should carry enough context to be actionable.

Overall, this PR aligns with the observability and isolation parts of the reference systems, but it is not trying to solve scheduling, persistence, or delivery routing.

Implementation Options

Option 1: Conservative diagnostics only
Capture stderr, log it at WARN, and append the recent stderr buffer to connection-closed errors. Skip the startup health check for now.

Option 2: Balanced runtime diagnostics plus startup health check
Merge the proposed approach: stderr capture, bounded per-connection ring buffer, enriched connection errors, and a startup SessionPool::health_check that verifies initialize plus session/new.

Option 3: Ambitious agent readiness subsystem
Add structured agent readiness state, startup health checks, periodic rechecks, health metrics, admin-visible status, retry/backoff for failed startup, and possibly per-agent run logs.

Comparison Table

Option Speed to ship Complexity Reliability Maintainability User impact Fit for OpenAB right now
Conservative diagnostics only High Low Medium High Medium Good if reviewers want minimal change
Balanced diagnostics + startup health check Medium Medium High Medium-High High Best fit
Agent readiness subsystem Low High High Medium High Too large for this PR

Recommendation

Advance the balanced option, but review it tightly around process I/O safety and startup semantics.

The PR addresses a real deployment pain point with a reasonably small surface area. The right merge discussion should focus on whether stderr reading is non-blocking enough, whether the 8-line buffer is sufficient, whether noisy stderr should always be WARN, and whether a failed health check should only log or also affect process readiness.

If reviewers want to reduce risk, split follow-up work into a second PR for richer readiness behavior, metrics, or retry/backoff. This PR should stay focused on making agent failures visible and actionable.

@shaun-agent
Copy link
Copy Markdown
Contributor

took this over and pushed ().

fixed:

  • made stderr capture deterministic on connection close by waiting briefly for the stderr reader to drain before synthesizing the error
  • kept the stderr tail logic, but moved the final error construction behind that drain boundary so the fast-fail startup case is no longer racey
  • added focused regression coverage for:
    • close-message formatting in
    • surfacing a fast-fail stderr message in

found:

  • the original PR had a real race between the stdout EOF path and the detached stderr reader task, so the exact failure this PR wanted to expose could still collapse back to a plain

validation:

  • attempted
  • attempted
  • test execution is still blocked in this container because the Rust toolchain cannot link ( not found)

@shaun-agent
Copy link
Copy Markdown
Contributor

clean follow-up comment because the previous one was mangled by shell interpolation.

pushed commit: d0a3986
commit title: fix(acp): wait for stderr drain on close

fixed:

  • made stderr capture deterministic on connection close by waiting briefly for the stderr reader to drain before synthesizing the final connection-closed error
  • kept the bounded stderr tail, but moved final error construction behind that drain boundary so the fast-fail startup case is no longer racey
  • added focused regression coverage for close-message formatting in src/acp/connection.rs
  • added focused regression coverage for SessionPool::health_check surfacing a fast-fail stderr message in src/acp/pool.rs

found:

  • the original PR had a real race between the stdout EOF path and the detached stderr reader task, so the exact failure this PR wanted to expose could still collapse back to a plain connection closed error

validation:

  • ran cargo fmt scoped to src/acp/connection.rs and src/acp/pool.rs
  • attempted cargo test health_check_surfaces_fast_fail_stderr -- --nocapture
  • attempted cargo test connection_closed_message_includes_stderr_tail -- --nocapture
  • test execution is still blocked in this container because the Rust toolchain cannot link: cc not found

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

closing-soon PR missing Discord Discussion URL — will auto-close in 3 days needs-rebase pending-contributor pending-screening PR awaiting automated screening

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants