feat(acp): surface agent stderr and add startup health check by apple8409 · Pull Request #752 · openabdev/openab

apple8409 · 2026-05-06T01:59:39Z

Summary

Pipe agent stderr instead of discarding it (/dev/null). Each line is logged at WARN level (openab::agent target) and buffered in an 8-line ring buffer per connection.
Richer connection-closed errors: when an agent process exits unexpectedly, the last buffered stderr lines are appended to the error message — e.g. connection closed; agent stderr: env: node: No such file or directory — making the root cause self-contained in the log without needing to cross-reference a separate error log file.
Startup health check (SessionPool::health_check): on boot, OpenAB spawns a temporary agent connection and runs initialize + session/new. A clear ERROR is logged if the agent cannot be reached, instead of failing silently on the first user message.

Motivation

Diagnosing agent spawn failures was difficult because:

stderr was routed to /dev/null, swallowing all agent error output
The only visible symptom was a generic JSON-RPC error -1: connection closed appearing only after a user message triggered a dispatch
No distinction between "bot connected to Discord" and "bot can actually reach the agent"

A concrete example: on macOS, launchd services run with a stripped PATH (/usr/bin:/bin:/usr/sbin:/sbin). If the agent binary uses #!/usr/bin/env node and Node.js is installed via Homebrew (/opt/homebrew/bin), the process exits immediately with env: node: No such file or directory — previously invisible, now surfaced in the log.

Test plan

Verify agent health check passed appears in log on normal startup
Temporarily break agent path or remove node from PATH, confirm log shows agent health check FAILED: connection closed; agent stderr: env: node: No such file or directory
Confirm normal agent stderr (e.g. Ripgrep warnings) appears as WARN [openab::agent] [stderr] ... in bot log
Confirm bot still functions correctly end-to-end after these changes

🤖 Generated with Claude Code

- Pipe agent stderr instead of discarding it; each line is logged at WARN level (openab::agent target) and buffered in a 8-line ring buffer - On connection close, append the buffered stderr tail to the error message so "connection closed; agent stderr: ..." is self-contained - Add SessionPool::health_check() that runs initialize + session/new at startup and logs a clear ERROR if the agent cannot be reached, instead of failing silently on the first user message Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

github-actions · 2026-05-06T01:59:57Z

⚠️ This PR is missing a Discord Discussion URL in the body.

All PRs must reference a prior Discord discussion to ensure community alignment before implementation.

Please edit the PR description to include a link like:

Discord Discussion URL: https://discord.com/channels/...

This PR will be automatically closed in 3 days if the link is not added.

shaun-agent · 2026-05-06T02:01:32Z

OpenAB PR Screening

This is auto-generated by the OpenAB project-screening flow for context collection and reviewer handoff.
Click 👍 if you find this useful. Human review will be done within 24 hours. We appreciate your support and contribution 🙏

Title: feat(acp): surface agent stderr and add startup health check
Source: feat(acp): surface agent stderr and add startup health check #752
Status: moved to PR-Screening
Generated at: 2026-05-06T02:01:32.178Z
Discord thread: https://discord.com/channels/1488041051187974246/1501403305061126376

Screening report

## Intent

PR #752 is trying to make ACP agent startup and connection failures diagnosable from the main OpenAB logs.

The operator-visible problem is that agent process failures currently collapse into generic connection errors like JSON-RPC error -1: connection closed, while the actual cause may be hidden on stderr or discarded entirely. This especially hurts service deployments where environment differences, such as a stripped PATH, cause the agent process to exit before handling requests.

Feat

This is a feature / observability improvement for ACP agent runtime operations.

It changes behavior by:

piping agent stderr into OpenAB logs instead of discarding it
buffering recent stderr lines per connection
appending recent stderr to connection-closed errors
adding a startup health check that runs initialize and session/new against a temporary agent connection on boot

Who It Serves

Primary beneficiaries: agent runtime operators and deployers.

Secondary beneficiaries: maintainers and reviewers debugging failed OpenAB deployments, especially Discord bot deployments where the bot may connect successfully while the backing agent is broken.

Rewritten Prompt

Implement ACP agent diagnostics for startup and unexpected process exit failures.

Update the ACP connection layer so spawned agent stderr is captured, logged line-by-line at WARN under the openab::agent target, and stored in a small per-connection ring buffer. When the agent connection closes unexpectedly, include the most recent stderr lines in the returned error message.

Add a SessionPool::health_check method that creates a temporary ACP connection during application startup and verifies the agent can complete initialize and session/new. Log a clear startup error if the check fails, without requiring the first user message to reveal the problem.

Keep the implementation scoped to diagnostics. Add or document verification for successful startup, broken agent executable/path, stderr logging, and normal end-to-end agent use.

Merge Pitch

This is worth advancing because it turns a high-friction operational failure mode into a self-contained log message. The change improves deployability without changing the user-facing ACP workflow.

Risk profile is moderate-low: the main behavior change is around process stderr handling and startup probing. The likely reviewer concern is whether stderr capture can deadlock, leak output across sessions, over-log noisy tools, or cause startup failure semantics that are too aggressive for production.

Best-Practice Comparison

OpenClaw principles that apply:

Explicit run logs: relevant. Surfacing stderr in the main log improves operator diagnosis.
Isolated executions: partially relevant. Each connection maintaining its own stderr buffer matches the idea that execution state should not bleed across runs.
Retry/backoff and run logs: retry/backoff is not directly addressed, but this PR improves the log side needed before retry policy is useful.
Gateway-owned scheduling and durable job persistence: not directly relevant. This PR is not about scheduling or job durability.
Explicit delivery routing: not relevant.

Hermes Agent principles that apply:

Gateway daemon tick model: not directly relevant.
File locking to prevent overlap: not relevant.
Atomic writes for persisted state: not relevant because this PR does not persist state.
Fresh session per scheduled run: partially relevant. The health check uses a temporary connection/session instead of contaminating user sessions.
Self-contained prompts for scheduled tasks: not relevant, but the same self-contained-diagnostics principle applies: failures should carry enough context to be actionable.

Overall, this PR aligns with the observability and isolation parts of the reference systems, but it is not trying to solve scheduling, persistence, or delivery routing.

Implementation Options

Option 1: Conservative diagnostics only
Capture stderr, log it at WARN, and append the recent stderr buffer to connection-closed errors. Skip the startup health check for now.

Option 2: Balanced runtime diagnostics plus startup health check
Merge the proposed approach: stderr capture, bounded per-connection ring buffer, enriched connection errors, and a startup SessionPool::health_check that verifies initialize plus session/new.

Option 3: Ambitious agent readiness subsystem
Add structured agent readiness state, startup health checks, periodic rechecks, health metrics, admin-visible status, retry/backoff for failed startup, and possibly per-agent run logs.

Comparison Table

Option	Speed to ship	Complexity	Reliability	Maintainability	User impact	Fit for OpenAB right now
Conservative diagnostics only	High	Low	Medium	High	Medium	Good if reviewers want minimal change
Balanced diagnostics + startup health check	Medium	Medium	High	Medium-High	High	Best fit
Agent readiness subsystem	Low	High	High	Medium	High	Too large for this PR

Recommendation

Advance the balanced option, but review it tightly around process I/O safety and startup semantics.

The PR addresses a real deployment pain point with a reasonably small surface area. The right merge discussion should focus on whether stderr reading is non-blocking enough, whether the 8-line buffer is sufficient, whether noisy stderr should always be WARN, and whether a failed health check should only log or also affect process readiness.

If reviewers want to reduce risk, split follow-up work into a second PR for richer readiness behavior, metrics, or retry/backoff. This PR should stay focused on making agent failures visible and actionable.

shaun-agent · 2026-05-06T13:12:53Z

took this over and pushed ().

fixed:

made stderr capture deterministic on connection close by waiting briefly for the stderr reader to drain before synthesizing the error
kept the stderr tail logic, but moved the final error construction behind that drain boundary so the fast-fail startup case is no longer racey
added focused regression coverage for:
- close-message formatting in
- surfacing a fast-fail stderr message in

found:

the original PR had a real race between the stdout EOF path and the detached stderr reader task, so the exact failure this PR wanted to expose could still collapse back to a plain

validation:

attempted
attempted
test execution is still blocked in this container because the Rust toolchain cannot link ( not found)

shaun-agent · 2026-05-06T13:13:25Z

clean follow-up comment because the previous one was mangled by shell interpolation.

pushed commit: d0a3986
commit title: fix(acp): wait for stderr drain on close

fixed:

made stderr capture deterministic on connection close by waiting briefly for the stderr reader to drain before synthesizing the final connection-closed error
kept the bounded stderr tail, but moved final error construction behind that drain boundary so the fast-fail startup case is no longer racey
added focused regression coverage for close-message formatting in src/acp/connection.rs
added focused regression coverage for SessionPool::health_check surfacing a fast-fail stderr message in src/acp/pool.rs

found:

the original PR had a real race between the stdout EOF path and the detached stderr reader task, so the exact failure this PR wanted to expose could still collapse back to a plain connection closed error

validation:

ran cargo fmt scoped to src/acp/connection.rs and src/acp/pool.rs
attempted cargo test health_check_surfaces_fast_fail_stderr -- --nocapture
attempted cargo test connection_closed_message_includes_stderr_tail -- --nocapture
test execution is still blocked in this container because the Rust toolchain cannot link: cc not found

apple8409 requested a review from thepagent as a code owner May 6, 2026 01:59

github-actions Bot added pending-screening PR awaiting automated screening closing-soon PR missing Discord Discussion URL — will auto-close in 3 days labels May 6, 2026

fix(acp): wait for stderr drain on close

d0a3986

github-actions Bot added needs-rebase pending-contributor labels May 6, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(acp): surface agent stderr and add startup health check#752

feat(acp): surface agent stderr and add startup health check#752
apple8409 wants to merge 2 commits intoopenabdev:mainfrom
apple8409:main

apple8409 commented May 6, 2026

Uh oh!

github-actions Bot commented May 6, 2026

Uh oh!

shaun-agent commented May 6, 2026

Feat

Who It Serves

Rewritten Prompt

Merge Pitch

Best-Practice Comparison

Implementation Options

Comparison Table

Recommendation

Uh oh!

shaun-agent commented May 6, 2026

Uh oh!

shaun-agent commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

apple8409 commented May 6, 2026

Summary

Motivation

Test plan

Uh oh!

github-actions Bot commented May 6, 2026

Uh oh!

shaun-agent commented May 6, 2026

OpenAB PR Screening

Feat

Who It Serves

Rewritten Prompt

Merge Pitch

Best-Practice Comparison

Implementation Options

Comparison Table

Recommendation

Uh oh!

shaun-agent commented May 6, 2026

validation:

Uh oh!

shaun-agent commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants