feat(coding-agents): platform primitive (MVP through Fly Sprites + bug hunt)#4256
Open
feat(coding-agents): platform primitive (MVP through Fly Sprites + bug hunt)#4256
Conversation
Slice B finishes the platform-primitive migration: resume via nativeJsonl tee + cold-boot materialization, Horton tool migration to spawn_coding_agent / prompt_coding_agent, full removal of the legacy coder entity (source, tools, runtime types, UI, bootstrap), and full UI revamp (CodingAgent* components, status enum extension, header Pin/Release/Stop buttons, lifecycle row rendering). Plus a runtime- level e2e test that closes the gap which hid Slice A's slug and flat-schema bugs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The eager rebuild was scoped here to support state().workspace.sharedRefs accuracy after server restart, but the UI indicator consuming that field (sandbox provenance / 'shared with N' header) is also Slice C. Defer eager rebuild to land alongside its consumer; keep Slice A's lazy per-agent rebuild on first handler entry.
…ionId to sessionMeta
…tiveSessionId per turn Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds sanitiseCwd/materialiseResume helpers and calls materialiseResume in processPrompt (after sandbox.started, before wr.acquire) so that `claude --resume <sessionId>` finds its JSONL session file inside the sandbox on every cold-boot. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…entity definition
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Slice B's plan teed claude's stream-json stdout into nativeJsonl, then materialized those lines as the resume file. That's wrong — claude --resume reads its on-disk transcript, which has a different shape than the stream-json stdout. The bridge captures the wrong data; --resume rejects the malformed file. Reworked: - nativeJsonl is now a single-row transcript blob (key='current', nativeSessionId, content) holding the actual on-disk transcript. - Handler reads the transcript via base64-piped docker exec after each successful turn (captureTranscript) and writes the full blob to nativeJsonl, overwriting the previous capture. - Cold-boot materialize reads the single row's content and pipes it back via base64. - Bridge no longer relies on agent-session-protocol@0.0.2 to extract session_id (it reads entry.sessionId in camelCase but claude emits session_id in snake_case, so the protocol returns ''). Bridge now parses the raw stdout JSON directly to extract session_id. - onNativeLine is no longer used by the handler. The bridge still invokes it (Task 1.1's test still passes), but the handler doesn't subscribe. - Updated handler-resume.test.ts seeds and assertion to match the new single-row schema. Verified end-to-end with DOCKER=1 slice-b integration test (BANANA roundtrip): turn 1 establishes "favorite fruit is BANANA"; turn 2 on a fresh sandbox correctly recalls "BANANA" from the materialized transcript. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…oding_agent/prompt_coding_agent
…der.ts) and unregister from bootstrap Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…Agent implementation Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…and add new tool cases
…entTimeline, CodingAgentSpawnDialog; delete legacy CodingSession components Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… and Pin/Release/Stop buttons into router and sidebar - Replace CodingSessionView/CODING_SESSION_ENTITY_TYPE with CodingAgentView/coding-agent in router.tsx - Hoist useCodingAgent at router level; pass result as prop to CodingAgentView (single SSE connection) - Pass db to EntityHeader for Pin/Release/Stop inbox-dispatch buttons - Replace CodingSessionSpawnDialog with CodingAgentSpawnDialog in Sidebar; remove all legacy CODING_SESSION_ENTITY_TYPE refs - Add @electric-ax/coding-agents workspace dep; fix Streamdown content->children prop - UI typecheck clean Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Documents the migration completion: resume + Horton + legacy removal + UI revamp. Captures the major mid-flight pivot (resume mechanism changed from per-line stream-json tee to single-row on-disk transcript capture) and the upstream agent-session-protocol bug (reads sessionId camelCase but claude emits session_id snake_case). All 32 coding-agents tests + 388 runtime + 44 agents pass green with DOCKER=1. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… spawn-dialog initialPrompt through initialMessage Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ge build path Three real bugs surfaced when the agents-server-ui dist was rebuilt post-Slice-B and the manual smoke test exercised an entity in the browser: 1. router.tsx called useCodingAgent AFTER an early-return for missing selectedEntity → React error #310 ("rendered more hooks than during previous render"). Moved the hook call BEFORE the early return. 2. CodingAgentTimeline.tsx read tool_call payloads as `{toolName, args}` — but agent-session-protocol's ToolCallEvent uses `{tool, input}`. Tool calls rendered as generic "tool" with no name. Fixed to match the protocol's actual shape. 3. useCodingAgent.ts imported wire constants from @electric-ax/coding-agents — that package transitively pulls in node-only deps (LocalDockerProvider, StdioBridge, agent-session- protocol's randomUUID import) that vite can't externalize for the browser bundle. Hardcoded the four wire strings locally to break the import chain. Plus two infra changes that enable iterating on the agents-server image locally: - packages/agents-server/Dockerfile: copy `packages/coding-agents/package.json` and add a build step for the `coding-agents` workspace package (it didn't exist when this Dockerfile was written; agents now depends on it). - packages/electric-ax/docker-compose.full.yml: change agents-server pull_policy from `always` to `missing` so a locally-tagged `electricax/agents-server:local` image is honoured by the quickstart instead of failing with a registry pull error. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…es + electric in docker) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…quests The /send endpoint requires a 'from' identifier (used to tag inbox provenance). MessageInput sends 'user'; the new Pin/Release/Stop buttons forgot it and got HTTP 400 'Missing required field: from'. Added from='user' to all three. Verified with a curl POST that the endpoint now returns 204 No Content.
… on cold boot Previously, processPrompt unconditionally inserted sandbox.starting + sandbox.started lifecycle rows on every prompt. Since lm.ensureRunning is idempotent (returns the existing sandbox when one is running), warm prompts produced misleading 'Sandbox starting' entries between every turn in the UI timeline — the user's inference that the sandbox was restarting was understandable but wrong. Now both rows fire only when the prior status was 'cold'. Materialise of nativeJsonl is also gated on cold-boot (no need to re-write the file when the existing container already has it). Capture-transcript remains unconditional (we want a fresh snapshot after each turn). A small backfill path keeps sessionMeta.instanceId fresh on warm prompts when it was unset for some reason.
…ke_case) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… event types in CodingAgentTimeline Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tion Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… catch Phase 1 Task 1 from docs/superpowers/plans/2026-05-03-coding-agents- post-review-followups.md was driven by R2 #5: "chain leak under concurrent acquirers — first releaser's chain-pointer check never matches, entries grow unbounded". Empirical investigation: the bug doesn't manifest. Two new tests (N=8 concurrent acquire→release tasks; non-overlap + completion proof) confirm chainByIdentity drains to 0 after all releases and no acquirer is dropped. The reason: the chain-delete branch is gated by `remaining === 0` (last releaser), and at that point the chain pointer is necessarily the last `link` set (no new acquirer present). The two conditions are coupled, not racy. Keeping the tests as a forward-looking regression catch — any future refactor that introduces the failure mode the reviewer described will fail loudly. Plan task is closed as "not a bug, regression test added". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 1 Task 2; drives Phase 0 Task 0.2 (L2.10 conformance) green. Before this change: an agent that ended a prior turn in status='error' went straight to status='running' on the next prompt. wasCold was false (only checks 'cold'), so no sandbox.starting lifecycle row was emitted; the stale lastError stayed visible through completion. State-machine paper claims error→cold→starting→running; reality was error→running. After: at top of processPrompt (after cancelIdleTimer), if status is 'error', flip to 'cold' and clear lastError so the cold-boot path runs normally. Idempotent on every status that isn't 'error'. Verified: L2.10 conformance passes 3/3 kinds on LocalDocker (35s). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 1 Task 3; drives Phase 0 Task 0.5 (L2.13 conformance) green. Before this change: fork copied source events + nativeJsonl unconditionally. If the source was running/starting/stopping, events were still streaming when the fork's first-wake observed them; convertNativeJsonl produced a transcript ending mid-message and the fork's first --resume would corrupt state. After: at the top of the fork branch (right after the events collection check), if sourceMeta.status is running/starting/stopping we emit kind.convert_failed with detail "fork rejected: source not quiescent (status=...)", set lastError + status='error' on the fork, and return. The user can re-fork once the source quiesces. Verified: L2.13 conformance passes 3/3 kinds on LocalDocker; L2.7, L2.8, L2.9, L2.10, L2.11, L2.12 also still pass (21/21 in scope). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 1 Task 4 from R2 #8. L2.4 used to only assert runs[last].status === 'completed'. A provider that returned 'running' for a stale agent would reach reconcile's `isOrphaned` branch and transition status to `idle` — a different end-state than `cold` (the path for stopped/unknown). The original test passed for either outcome. Adding `expect(finalMeta.status).toMatch(/^(cold|idle)$/)` locks both branches into the contract — anything else (e.g. status stays `running` because reconcile mis-classified) fails loudly. Verified: still 3/3 on LocalDocker. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…mpotent close Phase 2 Task 5 from R1 #4. execWithStdinViaPost's writeStdin and closeStdin had two defects: - writeStdin appended to stdinBuf with no guard. After closeStdin fired the actual POST, any further write was silently buffered and lost (the data was never sent — the request had already flushed). - closeStdin re-fired `void start()` on every call. start() itself was idempotent via its `started` flag, but the contract is cleaner if closeStdin is itself a no-op the second time. Fix: a `closed` flag. writeStdin throws if called after close; closeStdin is a no-op the second time. Defence-in-depth: StdioBridge calls writeStdin then closeStdin sequentially per turn, so the new throw can't fire under normal usage. The guard catches future caller misuse loudly instead of silently swallowing data. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 2 Task 6 from R1 #5. Sprites' exec API treats per-call env in a way that's both unstable (rc30→rc43 protocol shift) and inconsistent (silently ignored when the cmd is shell-wrapped). Today's WS / POST-stdin paths both forwarded `req.env` via repeated `?env=KEY=VALUE&` query params; for the conformance harness's claude-on-sprites tests the per-call ANTHROPIC_API_KEY mirror reached the child only because the env file already had it. Fix: stage per-call env inside the wrapper script. wrapWithAgentEnv now takes an optional `env` arg and emits an `export KEY=value` line per pair, after sourcing /run/agent.env (so per-call values override the file's defaults). Env-key validation rejects shell-unsafe names to defend against caller-controlled injection. The WS and POST exec URLs no longer carry env query params at all. Verified: L1.5 (exec honours cwd and env) passes on the live sprites API; sprites unit tests 18/18 green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 2 Task 7 from R1 #7. The Sprites API paginates with `has_more` + `next_continuation_token`, but the SpritesApiClient only fetched a single page. A sprite buried past the first page would be silently missed by `findExisting`, causing the next `createSprite` for the same agentId to 409 with "name already exists". Same risk in the cleanup-sprites operator script and the new conformance afterAll. Adds `listAllSprites(opts)` that loops until `!has_more` (or runs out of `next_continuation_token`), with a 50-page hard cap as a defensive guard. Updates `findExisting`, `cleanup-sprites.ts`, and the conformance afterAll to use it. Sprites unit tests 18/18 green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 2 Task 8; drives Phase 0 Task 0.7 (L1.11 conformance) green on Host. Before: HostProvider.stop() was a no-op. The SandboxProvider contract (which L1.11 codifies) says stop() must terminate the running child within N seconds; calling stop() while a turn was mid-exec on the host left the child running unsupervised. LocalDocker met this by implication (container removal kills the process); sprites by the WS close path; host did not. After: AgentRecord tracks `activeChildren: Set<ChildProcess>`. Each spawn registers; child `exit`/`error` unregisters. stop() and destroy() now SIGTERM every active child for the matching agent and fall back to SIGKILL after a 5 s grace period via a shared terminateChildren helper. Verified: L1.10 + L1.11 both pass on Host (LocalDocker and sprites conformance unaffected). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The file's own comment said 'delete in a follow-up once the conformance suite has shipped for one release cycle' — that's now true. Slice-A lifecycle scenarios are exercised by local-docker-conformance.test.ts via runCodingAgentsIntegrationConformance. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…euristic
Phase 3 Task 11.
- sessionId now must match /^[A-Za-z0-9][A-Za-z0-9_-]*$/ (rejects
leading dashes like '-rf'). Defence-in-depth — adapters shellQuote
but the CLI boundary should reject obvious shell metacharacters
even before the entity boundary.
- isMain replaces `endsWith('import.js')` with
`path.basename(...) === 'import.js'` so a consumer file with that
suffix doesn't accidentally activate this CLI body when imported.
Test added for the leading-dash rejection. cli-import unit tests 10/10.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 3 Task 10. - model now matches /^[A-Za-z0-9._/:-]+$/. Without this, a model like `gpt-4";evil="x` would close the value quote in the `-c model="..."` flag and inject an arbitrary config key. Adapters are reachable from import CLI / spawn args paths and shouldn't trust caller-supplied input. - sessionId now matches /^[A-Za-z0-9-]+$/ (UUID shape) in buildCliInvocation, probeCommand, and captureCommand. Without this, a sessionId containing `*` or `?` would broaden the find glob silently and could match an unrelated transcript. Both validations throw early with a clear error rather than silently producing wrong argv. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 3 Task 9.
The handler races bridge.runTurn against runTimeoutMs via raceTimeout.
Before this change, the loser's child process kept running — raceTimeout
rejects the promise, but the orphan CLI stays attached to its sandbox
until the sandbox itself tears down.
Adds optional `signal?: AbortSignal` to RunTurnArgs. The bridge
forwards an abort to handle.kill('SIGTERM') and rejects with a clear
'aborted (signal)' error if the abort fires before normal exit. The
listener is removed on completion to avoid leaking the closure.
The handler creates an AbortController with a setTimeout matching
runTimeoutMs and passes its signal alongside the existing raceTimeout.
The two coexist: raceTimeout owns the promise-level rejection; the
signal owns the child-reaping. clearTimeout in finally avoids the
timer firing post-completion.
All unit tests pass (151/159; the 2 failing are pre-existing
handler-resume flakes unrelated to this change).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 3 Task 19: replaces a fixed setTimeout(2500) with a poll loop in slice-b.test.ts. R5 flagged the timing as flaky on slow CI — the idle timer fires at 1500 ms but the visible 'stopped'/'unknown' status transition includes container teardown which can push past the fixed wait. New helper polls every 100 ms with a 30 s ceiling. Bonus: typecheck-fix in conformance/integration.ts where the L2.13 scenario referenced a non-existent 'fork.failed' lifecycle event. The schema's enum uses 'kind.convert_failed' for both kind-converts and forks (see entity/collections.ts). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 4 Tasks 12 + 13. T12 — CodingAgentSpawnDialog.canSubmit now enforces the target ⇄ workspaceMode invariants explicitly: - target='host' requires workspaceMode='bindMount' - target='sprites' requires workspaceMode='volume' The button-click handlers force the right workspace when toggling target, but a strict-mode double-render or future refactor could submit a stale combo. Belt-and-braces. T13 — EntityHeader Convert-target and Convert-kind dropdowns now disable when status is 'error', not just running/starting/stopping or destroyed. Converting from a failed prior turn risks acting on stale lastError state — the user should retry to clear the error first. 'cold' stays allowed (it's a state-mutation only; no sandbox is up). Both files typecheck clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Property: for any input string s and any partition s = p1+p2+...+pN, feeding the partitions sequentially into StreamQueue and calling end() produces a line sequence identical to feeding s whole. Catches future regressions of the C2 line-tail-buffer fix in providers/fly-sprites/exec-adapter.ts. 7 canonical inputs × 200 deterministic seeds = 1400 partitions checked. Sub-second runtime. StreamQueue is now exported from exec-adapter.ts so the test can import it directly without going through the full ExecHandle factory. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Property: at the 900_000-byte cap, accept LIMIT-1 and LIMIT, reject LIMIT+1, and reject any multibyte string whose UTF-8 byte length overflows even when its UTF-16 code-unit length doesn't. Catches future regressions of the C5 fix in StdioBridge that replaced prompt.length with Buffer.byteLength(prompt, 'utf8'). One test specifically pins the multibyte boundary case where the old code would have silently accepted a too-long prompt; another asserts the error message reports byte count (not chars) so users can correlate with the limit. 5 cases pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Property: every adapter's buildCliInvocation produces byte-stable
argv across (kind × {prompt-only, with-model, with-session,
with-both}) input shapes.
Would have caught the opencode --print-logs accident at compile-time
of the test suite, not at L2.1 runtime. Catches future drift in
claude/codex/opencode argv shape — reviewer must explicitly approve
any change via 'pnpm test -u'.
10 snapshots locked. opencode prompt-only / session-only shapes are
skipped (the adapter requires a model).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Property: every adapter's probeCommand / captureCommand / postMaterialiseCommand must treat sessionId as data, not code. For each (adapter × adversarial input) pair, the resulting argv must either (a) not pass through sh -c (direct argv invocation is safe by construction), or (b) embed the adversarial input fully inside one sh-tokenized word's content, or (c) reject the input by throwing during the build (validation-throw is equivalent to safe). Adversarial corpus: 9 strings covering single-quote close-and-reopen, command substitution sigils, redirect operators, glob metacharacters, spaces, backslash, and the textbook '\'' escape attempt. Generalises the C6 fix (opencode shellQuote) — any future adapter that forgets to shell-quote a caller-controlled field will fail this test loudly. Tokenizer handles single-quote, double-quote (with the correct sh-style backslash escape rules: only $, \`, \", \\, \n are escapes inside double quotes), and outside-of-quotes backslash-escape. 7 cases pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Property: for every (initial status × control-plane message) pair, the resulting (final status, lastError-presence, lifecycle-events) is byte-stable against a checked-in snapshot. Catches state-machine drift over time. Any future handler change that alters a cell forces an explicit `pnpm test -u` and reviewer approval — the diff makes the new transition visible. Coverage: 7 statuses × 6 control-plane messages = 42 cells. The `prompt` message is excluded; it requires non-trivial bridge mocking and is covered end-to-end by the L2 conformance suite. Stub provider returns status='unknown' so reconcile orphan branches fire when initial status is 'running'. Stub bridge throws if called. Each cell isolates with a fresh handler+lm+wr triple. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 4 Task 14. PUT /coding-agent/<name> returns once the entity row is durable, but the handler registration that consumes our 'pin' payload lands on a separate path. Several specs raced spawn → pin → goto → assert and saw an empty timeline because the pin landed before init wiring was complete. The race wasn't reliably reproducible in this session but matches the R4 review finding. waitForEntityReady polls GET /coding-agent/<name> for up to 5 s before sending the pin. When the entity is registered and discoverable, fire the pin as before. Verified: full Playwright suite still passes 31/31 (1 skipped) on the running dev server. No spec changes needed — helper is internal. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The race the reviewer described wasn't reproducible in this session; the Playwright suite was 31/31 green both with and without the poll. The 100 ms × N poll overhead doesn't earn its keep. If a flake actually surfaces, re-add a targeted poll. Speculative coverage isn't worth the runtime cost. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
L2.9 codified a chain-leak property in WorkspaceRegistry that I empirically verified does NOT manifest (R2 #5 was a false positive, documented in the regression-catch unit tests). The 3-way concurrent test passed today by accident of correctness, not because it caught a bug. L2.6 (2-way non-overlap with the lease serialising concurrent runs) covers the load-bearing property. The unit-level workspace-registry tests added in commit 2021c5d cover the chain-pointer cleanup invariant directly without spinning up real sandboxes. Net: −68 lines, no coverage regression. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The two-timer redundancy (raceTimeout + AbortController firing in
parallel) was a code smell. The orphan-child cleanup is the sandbox
layer's responsibility:
- LocalDocker: container removal kills the child on next destroy().
- HostProvider: T8 added activeChildren tracking; stop()/destroy()
SIGTERM with SIGKILL fallback.
- FlySpriteProvider: WS close terminates the exec.
Without T9, the bridge has no concept of timeouts — that's correct
layering. raceTimeout in the handler rejects the runTurn promise; the
sandbox's next teardown reaps the child. The 'instant SIGTERM on
timeout' behaviour T9 added is nice-to-have but not load-bearing
given the sandbox-level guarantees.
Net: −41 lines (types + bridge + handler). 27/29 unit tests pass
(the 2 failures are pre-existing handler-resume flakes).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
DRY pass on the post-review additions: shellQuote was duplicated in claude.ts, codex.ts, opencode.ts (3 copies of the same 3-line function). Extracted to agents/shell-quote.ts so a future fix to the quoting algorithm lands in one place. isInFlight was duplicated in processStop, processConvertTarget, processConvertKind, and the fork-source quiescence guard (4 copies of the same status === running || starting || stopping check). Extracted to a top-level helper in handler.ts. Also makes the intent more self-documenting at call sites. Net: −20 lines, single source of truth for two cross-cutting concepts. Verified: all unit tests pass (163/171, with 2 pre-existing handler-resume flakes unrelated). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Thin wrapper skill for the electric-ax-import CLI: detects the current workspace + session id from the active Claude Code session, then runs the import against a running electric-agents server. The session shows up as a coding-agent entity in the UI (observable / forkable). Lives at packages/coding-agents/claude-skills/electric-import/SKILL.md; install with cp -R into ~/.claude/skills/. README.md added to both the package root and claude-skills/ documents the install + the trigger phrases. A note in the package README clarifies the supported scope: importing makes the session observable, NOT injectable. Claude Code has no third-party API for pushing user-messages into a running interactive session — see the research summary in the May 2026 session notes. Codex / opencode equivalents documented as out-of-scope-for-now in the skill's own 'Out of scope' section. Verified locally: `cp -R` into ~/.claude/skills/ and the skill registers in the available-skills list at session start. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reported: imported agent on target='host' showed 'Sandbox starting'
in the timeline, even though no sandbox is actually booting — host
provider does a quick stat() on the workspace and is otherwise an
attach. The misleading lifecycle event fires every time the agent
warms back up from cold (idle eviction → next prompt).
Suppressing the sandbox.starting / sandbox.started lifecycle inserts
when target='host'. Status transitions through 'starting' are
preserved (state-machine consistency). sandbox.failed stays — host
attach can still fail (workspace not a directory) and the failure is
meaningful.
L2.2 conformance still passes ('warm second prompt' asserts
not.toContain('sandbox.starting'); the new behaviour is a strict
superset).
Persisted lifecycle rows on existing agents aren't retroactively
cleaned. Fix only affects future cold-boots after the dev server
picks up the rebuild.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ks render line-per-line
Two bugs surfaced by inspecting an imported claude→codex fork.
1. Tool calls dropped (coding-agents):
codex exec --json emits shell invocations as
{type:'item.completed', item:{type:'command_execution', command,
aggregated_output, exit_code}}
The patched agent-session-protocol@0.0.2 only handles function_call /
function_call_output items — command_execution is silently dropped.
Result: every shell call codex made was invisible in the timeline.
Fix: a small pre-pass (codex-command-shim.ts) expands each
command_execution item into a function_call + function_call_output
pair on the wire so asp's existing matchers fire. Order preserved
(call before output, both share the item id so asp pairs them).
Cheap and self-contained — no upstream patch maintenance.
2. Assistant code-block lines rendered as one mashed string (UI):
Streamdown wraps each source line of a fenced code block in a
<span class='block ...'>. styles.css already has a rule that
forces those spans to display:block, but the rule is scoped to
and AssistantMessageRow forgot the
className. Result: spans stayed display:inline and 69 lines of a
tree listing rendered as .
Fix: add className='agent-ui-markdown' to AssistantMessageRow's
wrapper. Mirrors AgentResponse.tsx (Horton's renderer, which
already had it).
Verified: typecheck clean, 163/171 unit tests pass (2 pre-existing
handler-resume flakes). Send a fresh codex prompt to see both fixes
land — existing events rows aren't retroactively rewritten.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Claude prefers ANTHROPIC_API_KEY over CLAUDE_CODE_OAUTH_TOKEN when both
are present, treating the value verbatim as a plain API key. When the
host's ANTHROPIC_API_KEY actually contains an OAuth subscription token
(`sk-ant-oat...` — common when the dev shell inherited it from
claude.ai's keychain bridge), the previous register.ts mirrored it into
CLAUDE_CODE_OAUTH_TOKEN AND left ANTHROPIC_API_KEY in the forwarded
env. Inside the sandbox (which has no keychain fallback), claude picks
ANTHROPIC_API_KEY, hits the API as a plain key, and every turn fails
with "Invalid API key" -> exit 1, stderr empty (the JSON error lands on
stdout).
Symptom in stream: `cli-exit:claude CLI exited 1. stderr=<empty>` for
both bindMount and volume workspace types — workspaceType was a red
herring; the failure is auth-shape-only.
Fix: when ANTHROPIC_API_KEY starts with `sk-ant-oat`, promote to
CLAUDE_CODE_OAUTH_TOKEN and delete ANTHROPIC_API_KEY before forwarding.
Also extract the supplier as `defaultEnvSupplier` so it can be tested
directly with an injected env source.
Verified: spawning a fresh claude/sandbox/bindMount agent and a fresh
claude/sandbox/volume agent both complete the turn successfully
("OK" assistant text), where on `main` they both fail identically.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…arget=host Adds regression coverage for commit b26d41e. The fix in handler.ts' processPrompt cold-boot block (skipping sandbox.starting/sandbox.started lifecycle inserts when meta.target === 'host') was correct, but had no unit test pinning it. Reproduced live: a freshly spawned host-target coding-agent on the running dev server still showed 'Sandbox starting' in its timeline because the running start-builtin process predated the dist rebuild and was holding the pre-fix module in memory. After restarting the handler, host agents emit zero sandbox.* lifecycle rows end-to-end (verified: PUT spawn → POST prompt → curl /main → empty lifecycle collection). Two unit tests added in entity-handler.test.ts: 1. cold → starting → idle on host: status transitions through 'starting' for state-machine consistency, but neither sandbox.starting nor sandbox.started ends up in the lifecycle collection. 2. error → cold → starting → idle on host (re-prompt after a prior CLI exit): the error fall-through resets to 'cold' and re-runs the cold-boot block, so the host suppression must hold there too. Both tests fail with the host gates removed (verified locally) and pass with them in place. sandbox.failed is intentionally untouched — host attach can fail (e.g. workspace not a directory) and that's meaningful. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…or sprites Surfaced by R2's investigation of agent FYuEorn_F7. Codex inside our Docker sandbox container can't run any shell command on macOS Docker Desktop — codex's inner bwrap-based command sandbox fails with 'bwrap: No permissions to create a new namespace, likely because the kernel does not allow non-privileged user namespaces.' Result: every shell tool call dies, codex silently produces no useful output, and the user sees an agent that does nothing. Codex 0.128 ships --dangerously-bypass-approvals-and-sandbox documented as 'intended solely for running in environments that are externally sandboxed.' That's exactly target=sandbox (Docker container) and target=sprites (sprite is the workspace and the isolation boundary). For target=host we leave codex's normal sandbox active — no outer isolation, codex's bwrap layer is the only one. Threaded `target` through: RunTurnArgs -> stdio-bridge -> CodingAgentAdapter.buildCliInvocation claude / opencode adapters ignore the new field. codex uses it. The existing argv-stability snapshot doesn't change (target wasn't part of the input shapes covered). Tests: 4 new direct assertions in adapter-argv.test.ts covering sandbox / sprites / host / undefined target. 16/16 pass on the contract suite. Full unit run: 171 passed (the 2 failing handler-resume tests are pre-existing and unrelated). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Introduces
@electric-ax/coding-agents— a platform primitive for spawning, supervising, and resuming coding-agent sandboxes. Migrates Horton off the legacycoding-sessionentity onto the new typed runtime API and ships a dedicated UI view. Currently supports claude / codex / opencode as agent kinds and sandbox / host / sprites as sandbox targets.MVP and Slice A–C₂ (foundation)
LocalDockerProvider+StdioBridgerunningclaude --print --output-format=stream-jsoninside a containerised workspace.ctx.spawnCodingAgent/ctx.observeCodingAgent), entity handler with reconcile,LifecycleManager(idle timer, pin refcount),WorkspaceRegistry(per-identity mutex, slug resolution).coding-sessionremoval, dedicated UI components.SandboxInstance.copyTo(avoid argvARG_MAX), probe-and-materialise resume,--env-filefor secrets, idle-eviction wake.agent-session-protocol'sCodingAgentKindwidened to claude+codex; UI kind picker; per-kind probe/capture/materialise hooks).Conformance + cross-kind
packages/coding-agents/src/conformance/integration.ts): 16 scenarios × N kinds × M targets, pluggable across providers. Catches drift between LocalDocker / Host / Sprites.convert-kindswaps the CLI in place;fork(withfrom) spawns a sibling agent inheriting the parent's denormalised event history. Same-kind and cross-kind both supported.Slice: opencode (third agent kind)
openai/gpt-5.4-mini-fast); per-provider env-var fallback.Slice: Fly Sprites provider
sprites.dev (Fly's purpose-built agentic-sandbox product) ships as a third sandbox target alongside
sandbox(LocalDocker) andhost.convert-kindworks in place;forkwithin sprites carries conversation history.api.sprites.dev/v1/sprites/{name}/exec, not the per-sprite URL)0x01stdout,0x02stderr,0x03 <code>exit)opencode-aineeds install, with--prefix=/usr/local)set -a; . file; set +a)ANTHROPIC_API_KEYshaped assk-ant-oat...mirrored toCLAUDE_CODE_OAUTH_TOKEN)bootstrap.tsactually wires the sprites provider whenSPRITES_TOKENis setdocs/superpowers/plans/2026-05-02-coding-agents-fly-sprites.md§ Implementation findings — round 2.UI bug-hunt (Playwright MCP, 10 iterations)
Drove the UI end-to-end through Playwright MCP against the live dev stack. Surfaced and shipped four fixes:
cleanup:spritesscript's PREFIXES missed the productioncoding-agent-prefix; production-spawned leaks were invisible.destroyedentities; gated inEntityHeader.tsx+ Playwright spec.pnpm cleanup:volumesoperator script (mirrors cleanup:sprites). Lists/deletes orphanedcoding-agent-workspace-*volumes; default skips still-mounted ones.dev.mjslost stream registry acrossup-after-downbecause the embeddedDurableStreamTestServerkept state in memory. Now setsELECTRIC_AGENTS_STREAMS_DATA_DIR=.local/dev-streams;clear-statewipes it alongside compose volumes.Bug-hunt report:
docs/superpowers/specs/2026-05-03-bug-hunt-report.md.Documentation
packages/coding-agents/README.mdrestructured around current state: TOC, quick-reference table, lifecycle/inbox-message/status reference, targets capability matrix, cleanup utilities, consolidated TLs.docs/superpowers/specs/2026-05-02-coding-agents-fly-sprites-design.mdcarries a header banner pointing at round-2 findings and inlines the corrections in §1.docs/superpowers/plans/2026-05-02-coding-agents-fly-sprites.mdincludes both round-1 and round-2 implementation findings.Test plan
pnpm -C packages/coding-agents test(unit) — 139 green; new sprites tests cover stream-id demux, exit-frame fallback, name format[a-z0-9-]+, bootstrap content, wiring smoke.DOCKER=1LocalDocker conformance — full 16 scenarios across claude + codex + opencode.HOST_PROVIDER=1host-provider conformance.SPRITES=1 SPRITES_TOKEN=…sprites-wiring.e2e.test.tsgreen (2.5 s) — guards bootstrap-wiring + sprite name format regressions.pnpm -C packages/agents-server-ui exec playwright test—spawn-via-dialog.spec.ts(6 cases) +spawn-sprites.spec.ts(2 cases).fly-sprites-conformance.test.tsre-run under round-2 fixes (vitest verbose-reporter buffering issue; needs streaming-reporter or per-scenario splits).🤖 Generated with Claude Code