feat(acp): surface agent stderr and add startup health check#752
feat(acp): surface agent stderr and add startup health check#752apple8409 wants to merge 2 commits intoopenabdev:mainfrom
Conversation
- Pipe agent stderr instead of discarding it; each line is logged at WARN level (openab::agent target) and buffered in a 8-line ring buffer - On connection close, append the buffered stderr tail to the error message so "connection closed; agent stderr: ..." is self-contained - Add SessionPool::health_check() that runs initialize + session/new at startup and logs a clear ERROR if the agent cannot be reached, instead of failing silently on the first user message Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
All PRs must reference a prior Discord discussion to ensure community alignment before implementation. Please edit the PR description to include a link like: This PR will be automatically closed in 3 days if the link is not added. |
OpenAB PR ScreeningThis is auto-generated by the OpenAB project-screening flow for context collection and reviewer handoff.
Screening report## IntentPR #752 is trying to make ACP agent startup and connection failures diagnosable from the main OpenAB logs. The operator-visible problem is that agent process failures currently collapse into generic connection errors like FeatThis is a feature / observability improvement for ACP agent runtime operations. It changes behavior by:
Who It ServesPrimary beneficiaries: agent runtime operators and deployers. Secondary beneficiaries: maintainers and reviewers debugging failed OpenAB deployments, especially Discord bot deployments where the bot may connect successfully while the backing agent is broken. Rewritten PromptImplement ACP agent diagnostics for startup and unexpected process exit failures. Update the ACP connection layer so spawned agent stderr is captured, logged line-by-line at Add a Keep the implementation scoped to diagnostics. Add or document verification for successful startup, broken agent executable/path, stderr logging, and normal end-to-end agent use. Merge PitchThis is worth advancing because it turns a high-friction operational failure mode into a self-contained log message. The change improves deployability without changing the user-facing ACP workflow. Risk profile is moderate-low: the main behavior change is around process stderr handling and startup probing. The likely reviewer concern is whether stderr capture can deadlock, leak output across sessions, over-log noisy tools, or cause startup failure semantics that are too aggressive for production. Best-Practice ComparisonOpenClaw principles that apply:
Hermes Agent principles that apply:
Overall, this PR aligns with the observability and isolation parts of the reference systems, but it is not trying to solve scheduling, persistence, or delivery routing. Implementation OptionsOption 1: Conservative diagnostics only Option 2: Balanced runtime diagnostics plus startup health check Option 3: Ambitious agent readiness subsystem Comparison Table
RecommendationAdvance the balanced option, but review it tightly around process I/O safety and startup semantics. The PR addresses a real deployment pain point with a reasonably small surface area. The right merge discussion should focus on whether stderr reading is non-blocking enough, whether the 8-line buffer is sufficient, whether noisy stderr should always be If reviewers want to reduce risk, split follow-up work into a second PR for richer readiness behavior, metrics, or retry/backoff. This PR should stay focused on making agent failures visible and actionable. |
|
took this over and pushed (). fixed:
found:
validation:
|
|
clean follow-up comment because the previous one was mangled by shell interpolation. pushed commit: d0a3986 fixed:
found:
validation:
|
Summary
/dev/null). Each line is logged atWARNlevel (openab::agenttarget) and buffered in an 8-line ring buffer per connection.connection closed; agent stderr: env: node: No such file or directory— making the root cause self-contained in the log without needing to cross-reference a separate error log file.SessionPool::health_check): on boot, OpenAB spawns a temporary agent connection and runsinitialize+session/new. A clearERRORis logged if the agent cannot be reached, instead of failing silently on the first user message.Motivation
Diagnosing agent spawn failures was difficult because:
/dev/null, swallowing all agent error outputJSON-RPC error -1: connection closedappearing only after a user message triggered a dispatchA concrete example: on macOS,
launchdservices run with a strippedPATH(/usr/bin:/bin:/usr/sbin:/sbin). If the agent binary uses#!/usr/bin/env nodeand Node.js is installed via Homebrew (/opt/homebrew/bin), the process exits immediately withenv: node: No such file or directory— previously invisible, now surfaced in the log.Test plan
agent health check passedappears in log on normal startupnodefrom PATH, confirm log showsagent health check FAILED: connection closed; agent stderr: env: node: No such file or directoryWARN [openab::agent] [stderr] ...in bot log🤖 Generated with Claude Code