feat(coding-agents): platform primitive (MVP through Fly Sprites + bug hunt) by balegas · Pull Request #4256 · electric-sql/electric

balegas · 2026-05-01T01:50:12Z

Summary

Introduces @electric-ax/coding-agents — a platform primitive for spawning, supervising, and resuming coding-agent sandboxes. Migrates Horton off the legacy coding-session entity onto the new typed runtime API and ships a dedicated UI view. Currently supports claude / codex / opencode as agent kinds and sandbox / host / sprites as sandbox targets.

MVP and Slice A–C₂ (foundation)

MVP: LocalDockerProvider + StdioBridge running claude --print --output-format=stream-json inside a containerised workspace.
Slice A: Runtime API (ctx.spawnCodingAgent / ctx.observeCodingAgent), entity handler with reconcile, LifecycleManager (idle timer, pin refcount), WorkspaceRegistry (per-identity mutex, slug resolution).
Slice B: Lossless resume via post-turn transcript materialisation, Horton tool migration, legacy coding-session removal, dedicated UI components.
Slice C₁: SandboxInstance.copyTo (avoid argv ARG_MAX), probe-and-materialise resume, --env-file for secrets, idle-eviction wake.
Slice C₂: Codex parity via the agent adapter registry (agent-session-protocol's CodingAgentKind widened to claude+codex; UI kind picker; per-kind probe/capture/materialise hooks).

Conformance + cross-kind

Conformance suite (packages/coding-agents/src/conformance/integration.ts): 16 scenarios × N kinds × M targets, pluggable across providers. Catches drift between LocalDocker / Host / Sprites.
Cross-kind resume + fork: convert-kind swaps the CLI in place; fork (with from) spawns a sibling agent inheriting the parent's denormalised event history. Same-kind and cross-kind both supported.

Slice: opencode (third agent kind)

opencode-ai joins claude + codex as a first-class spawnable kind.
Curated model picker (defaults to openai/gpt-5.4-mini-fast); per-provider env-var fallback.
Cross-kind UI gated for opencode in v1 (discoverable absence; tooltip points at the deferred follow-up).

Slice: Fly Sprites provider

sprites.dev (Fly's purpose-built agentic-sandbox product) ships as a third sandbox target alongside sandbox (LocalDocker) and host.

All three coding-agent kinds work on sprites.
convert-kind works in place; fork within sprites carries conversation history.
Cross-provider transitions (sandbox/host ↔ sprites) are intentionally not supported — UI surfaces the option but disables it with an explanatory tooltip.
Round-1 implementation (early commits) was based on a doc-only recon of API rc30. Round-2 fixes (against the live rc43 server, post-merge bug-hunt) corrected:
- Exec endpoint URL (api.sprites.dev/v1/sprites/{name}/exec, not the per-sprite URL)
- Output frame multiplexing (1-byte stream-id prefix: 0x01 stdout, 0x02 stderr, 0x03 <code> exit)
- Stdin via HTTP POST (the WS stdin protocol shifted between rc30 and rc43)
- Bootstrap script (default Ubuntu image preinstalls claude/codex/gemini/node — only opencode-ai needs install, with --prefix=/usr/local)
- Env file source + export (set -a; . file; set +a)
- OAuth-token mirror (ANTHROPIC_API_KEY shaped as sk-ant-oat... mirrored to CLAUDE_CODE_OAUTH_TOKEN)
- bootstrap.ts actually wires the sprites provider when SPRITES_TOKEN is set
Full bug-by-bug record: docs/superpowers/plans/2026-05-02-coding-agents-fly-sprites.md § Implementation findings — round 2.

UI bug-hunt (Playwright MCP, 10 iterations)

Drove the UI end-to-end through Playwright MCP against the live dev stack. Surfaced and shipped four fixes:

F-1: cleanup:sprites script's PREFIXES missed the production coding-agent- prefix; production-spawned leaks were invisible.
F-2: Pin / Release / Stop / Convert-target / Convert-kind buttons stayed enabled on destroyed entities; gated in EntityHeader.tsx + Playwright spec.
F-3: New pnpm cleanup:volumes operator script (mirrors cleanup:sprites). Lists/deletes orphaned coding-agent-workspace-* volumes; default skips still-mounted ones.
F-4: dev.mjs lost stream registry across up-after-down because the embedded DurableStreamTestServer kept state in memory. Now sets ELECTRIC_AGENTS_STREAMS_DATA_DIR=.local/dev-streams; clear-state wipes it alongside compose volumes.

Bug-hunt report: docs/superpowers/specs/2026-05-03-bug-hunt-report.md.

Documentation

packages/coding-agents/README.md restructured around current state: TOC, quick-reference table, lifecycle/inbox-message/status reference, targets capability matrix, cleanup utilities, consolidated TLs.
docs/superpowers/specs/2026-05-02-coding-agents-fly-sprites-design.md carries a header banner pointing at round-2 findings and inlines the corrections in §1.
docs/superpowers/plans/2026-05-02-coding-agents-fly-sprites.md includes both round-1 and round-2 implementation findings.

Test plan

pnpm -C packages/coding-agents test (unit) — 139 green; new sprites tests cover stream-id demux, exit-frame fallback, name format [a-z0-9-]+, bootstrap content, wiring smoke.
DOCKER=1 LocalDocker conformance — full 16 scenarios across claude + codex + opencode.
HOST_PROVIDER=1 host-provider conformance.
SPRITES=1 SPRITES_TOKEN=… sprites-wiring.e2e.test.ts green (2.5 s) — guards bootstrap-wiring + sprite name format regressions.
pnpm -C packages/agents-server-ui exec playwright test — spawn-via-dialog.spec.ts (6 cases) + spawn-sprites.spec.ts (2 cases).
Manual UI walkthrough end-to-end: 10 bug-hunt iterations covering claude/codex/opencode × sandbox/host × volume/bindMount, convert-kind transcript carry, same-kind + cross-kind fork, pin/release/stop/kill lifecycle, sprites first-turn including bootstrap, horton.
Deferred: full fly-sprites-conformance.test.ts re-run under round-2 fixes (vitest verbose-reporter buffering issue; needs streaming-reporter or per-scenario splits).

🤖 Generated with Claude Code

Slice B finishes the platform-primitive migration: resume via nativeJsonl tee + cold-boot materialization, Horton tool migration to spawn_coding_agent / prompt_coding_agent, full removal of the legacy coder entity (source, tools, runtime types, UI, bootstrap), and full UI revamp (CodingAgent* components, status enum extension, header Pin/Release/Stop buttons, lifecycle row rendering). Plus a runtime- level e2e test that closes the gap which hid Slice A's slug and flat-schema bugs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The eager rebuild was scoped here to support state().workspace.sharedRefs accuracy after server restart, but the UI indicator consuming that field (sandbox provenance / 'shared with N' header) is also Slice C. Defer eager rebuild to land alongside its consumer; keep Slice A's lazy per-agent rebuild on first handler entry.

…ionId to sessionMeta

…ridge

…tiveSessionId per turn Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds sanitiseCwd/materialiseResume helpers and calls materialiseResume in processPrompt (after sandbox.started, before wr.acquire) so that `claude --resume <sessionId>` finds its JSONL session file inside the sandbox on every cold-boot. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…entity definition

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Slice B's plan teed claude's stream-json stdout into nativeJsonl, then materialized those lines as the resume file. That's wrong — claude --resume reads its on-disk transcript, which has a different shape than the stream-json stdout. The bridge captures the wrong data; --resume rejects the malformed file. Reworked: - nativeJsonl is now a single-row transcript blob (key='current', nativeSessionId, content) holding the actual on-disk transcript. - Handler reads the transcript via base64-piped docker exec after each successful turn (captureTranscript) and writes the full blob to nativeJsonl, overwriting the previous capture. - Cold-boot materialize reads the single row's content and pipes it back via base64. - Bridge no longer relies on agent-session-protocol@0.0.2 to extract session_id (it reads entry.sessionId in camelCase but claude emits session_id in snake_case, so the protocol returns ''). Bridge now parses the raw stdout JSON directly to extract session_id. - onNativeLine is no longer used by the handler. The bridge still invokes it (Task 1.1's test still passes), but the handler doesn't subscribe. - Updated handler-resume.test.ts seeds and assertion to match the new single-row schema. Verified end-to-end with DOCKER=1 slice-b integration test (BANANA roundtrip): turn 1 establishes "favorite fruit is BANANA"; turn 2 on a fresh sandbox correctly recalls "BANANA" from the materialized transcript. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…oding_agent/prompt_coding_agent

…der.ts) and unregister from bootstrap Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…Agent implementation Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…and add new tool cases

…entTimeline, CodingAgentSpawnDialog; delete legacy CodingSession components Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… and Pin/Release/Stop buttons into router and sidebar - Replace CodingSessionView/CODING_SESSION_ENTITY_TYPE with CodingAgentView/coding-agent in router.tsx - Hoist useCodingAgent at router level; pass result as prop to CodingAgentView (single SSE connection) - Pass db to EntityHeader for Pin/Release/Stop inbox-dispatch buttons - Replace CodingSessionSpawnDialog with CodingAgentSpawnDialog in Sidebar; remove all legacy CODING_SESSION_ENTITY_TYPE refs - Add @electric-ax/coding-agents workspace dep; fix Streamdown content->children prop - UI typecheck clean Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Documents the migration completion: resume + Horton + legacy removal + UI revamp. Captures the major mid-flight pivot (resume mechanism changed from per-line stream-json tee to single-row on-disk transcript capture) and the upstream agent-session-protocol bug (reads sessionId camelCase but claude emits session_id snake_case). All 32 coding-agents tests + 388 runtime + 44 agents pass green with DOCKER=1. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… spawn-dialog initialPrompt through initialMessage Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ge build path Three real bugs surfaced when the agents-server-ui dist was rebuilt post-Slice-B and the manual smoke test exercised an entity in the browser: 1. router.tsx called useCodingAgent AFTER an early-return for missing selectedEntity → React error #310 ("rendered more hooks than during previous render"). Moved the hook call BEFORE the early return. 2. CodingAgentTimeline.tsx read tool_call payloads as `{toolName, args}` — but agent-session-protocol's ToolCallEvent uses `{tool, input}`. Tool calls rendered as generic "tool" with no name. Fixed to match the protocol's actual shape. 3. useCodingAgent.ts imported wire constants from @electric-ax/coding-agents — that package transitively pulls in node-only deps (LocalDockerProvider, StdioBridge, agent-session- protocol's randomUUID import) that vite can't externalize for the browser bundle. Hardcoded the four wire strings locally to break the import chain. Plus two infra changes that enable iterating on the agents-server image locally: - packages/agents-server/Dockerfile: copy `packages/coding-agents/package.json` and add a build step for the `coding-agents` workspace package (it didn't exist when this Dockerfile was written; agents now depends on it). - packages/electric-ax/docker-compose.full.yml: change agents-server pull_policy from `always` to `missing` so a locally-tagged `electricax/agents-server:local` image is honoured by the quickstart instead of failing with a registry pull error. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…es + electric in docker) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…quests The /send endpoint requires a 'from' identifier (used to tag inbox provenance). MessageInput sends 'user'; the new Pin/Release/Stop buttons forgot it and got HTTP 400 'Missing required field: from'. Added from='user' to all three. Verified with a curl POST that the endpoint now returns 204 No Content.

… on cold boot Previously, processPrompt unconditionally inserted sandbox.starting + sandbox.started lifecycle rows on every prompt. Since lm.ensureRunning is idempotent (returns the existing sandbox when one is running), warm prompts produced misleading 'Sandbox starting' entries between every turn in the UI timeline — the user's inference that the sandbox was restarting was understandable but wrong. Now both rows fire only when the prior status was 'cold'. Materialise of nativeJsonl is also gated on cold-boot (no need to re-write the file when the existing container already has it). Capture-transcript remains unconditional (we want a fresh snapshot after each turn). A small backfill path keeps sessionMeta.instanceId fresh on warm prompts when it was unset for some reason.

…ke_case) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… event types in CodingAgentTimeline Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…tion Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… catch Phase 1 Task 1 from docs/superpowers/plans/2026-05-03-coding-agents- post-review-followups.md was driven by R2 #5: "chain leak under concurrent acquirers — first releaser's chain-pointer check never matches, entries grow unbounded". Empirical investigation: the bug doesn't manifest. Two new tests (N=8 concurrent acquire→release tasks; non-overlap + completion proof) confirm chainByIdentity drains to 0 after all releases and no acquirer is dropped. The reason: the chain-delete branch is gated by `remaining === 0` (last releaser), and at that point the chain pointer is necessarily the last `link` set (no new acquirer present). The two conditions are coupled, not racy. Keeping the tests as a forward-looking regression catch — any future refactor that introduces the failure mode the reviewer described will fail loudly. Plan task is closed as "not a bug, regression test added". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Phase 1 Task 2; drives Phase 0 Task 0.2 (L2.10 conformance) green. Before this change: an agent that ended a prior turn in status='error' went straight to status='running' on the next prompt. wasCold was false (only checks 'cold'), so no sandbox.starting lifecycle row was emitted; the stale lastError stayed visible through completion. State-machine paper claims error→cold→starting→running; reality was error→running. After: at top of processPrompt (after cancelIdleTimer), if status is 'error', flip to 'cold' and clear lastError so the cold-boot path runs normally. Idempotent on every status that isn't 'error'. Verified: L2.10 conformance passes 3/3 kinds on LocalDocker (35s). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Phase 1 Task 3; drives Phase 0 Task 0.5 (L2.13 conformance) green. Before this change: fork copied source events + nativeJsonl unconditionally. If the source was running/starting/stopping, events were still streaming when the fork's first-wake observed them; convertNativeJsonl produced a transcript ending mid-message and the fork's first --resume would corrupt state. After: at the top of the fork branch (right after the events collection check), if sourceMeta.status is running/starting/stopping we emit kind.convert_failed with detail "fork rejected: source not quiescent (status=...)", set lastError + status='error' on the fork, and return. The user can re-fork once the source quiesces. Verified: L2.13 conformance passes 3/3 kinds on LocalDocker; L2.7, L2.8, L2.9, L2.10, L2.11, L2.12 also still pass (21/21 in scope). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Phase 1 Task 4 from R2 #8. L2.4 used to only assert runs[last].status === 'completed'. A provider that returned 'running' for a stale agent would reach reconcile's `isOrphaned` branch and transition status to `idle` — a different end-state than `cold` (the path for stopped/unknown). The original test passed for either outcome. Adding `expect(finalMeta.status).toMatch(/^(cold|idle)$/)` locks both branches into the contract — anything else (e.g. status stays `running` because reconcile mis-classified) fails loudly. Verified: still 3/3 on LocalDocker. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…mpotent close Phase 2 Task 5 from R1 #4. execWithStdinViaPost's writeStdin and closeStdin had two defects: - writeStdin appended to stdinBuf with no guard. After closeStdin fired the actual POST, any further write was silently buffered and lost (the data was never sent — the request had already flushed). - closeStdin re-fired `void start()` on every call. start() itself was idempotent via its `started` flag, but the contract is cleaner if closeStdin is itself a no-op the second time. Fix: a `closed` flag. writeStdin throws if called after close; closeStdin is a no-op the second time. Defence-in-depth: StdioBridge calls writeStdin then closeStdin sequentially per turn, so the new throw can't fire under normal usage. The guard catches future caller misuse loudly instead of silently swallowing data. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Phase 2 Task 6 from R1 #5. Sprites' exec API treats per-call env in a way that's both unstable (rc30→rc43 protocol shift) and inconsistent (silently ignored when the cmd is shell-wrapped). Today's WS / POST-stdin paths both forwarded `req.env` via repeated `?env=KEY=VALUE&` query params; for the conformance harness's claude-on-sprites tests the per-call ANTHROPIC_API_KEY mirror reached the child only because the env file already had it. Fix: stage per-call env inside the wrapper script. wrapWithAgentEnv now takes an optional `env` arg and emits an `export KEY=value` line per pair, after sourcing /run/agent.env (so per-call values override the file's defaults). Env-key validation rejects shell-unsafe names to defend against caller-controlled injection. The WS and POST exec URLs no longer carry env query params at all. Verified: L1.5 (exec honours cwd and env) passes on the live sprites API; sprites unit tests 18/18 green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Phase 2 Task 7 from R1 #7. The Sprites API paginates with `has_more` + `next_continuation_token`, but the SpritesApiClient only fetched a single page. A sprite buried past the first page would be silently missed by `findExisting`, causing the next `createSprite` for the same agentId to 409 with "name already exists". Same risk in the cleanup-sprites operator script and the new conformance afterAll. Adds `listAllSprites(opts)` that loops until `!has_more` (or runs out of `next_continuation_token`), with a 50-page hard cap as a defensive guard. Updates `findExisting`, `cleanup-sprites.ts`, and the conformance afterAll to use it. Sprites unit tests 18/18 green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Phase 2 Task 8; drives Phase 0 Task 0.7 (L1.11 conformance) green on Host. Before: HostProvider.stop() was a no-op. The SandboxProvider contract (which L1.11 codifies) says stop() must terminate the running child within N seconds; calling stop() while a turn was mid-exec on the host left the child running unsupervised. LocalDocker met this by implication (container removal kills the process); sprites by the WS close path; host did not. After: AgentRecord tracks `activeChildren: Set<ChildProcess>`. Each spawn registers; child `exit`/`error` unregisters. stop() and destroy() now SIGTERM every active child for the matching agent and fall back to SIGKILL after a 5 s grace period via a shared terminateChildren helper. Verified: L1.10 + L1.11 both pass on Host (LocalDocker and sprites conformance unaffected). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The file's own comment said 'delete in a follow-up once the conformance suite has shipped for one release cycle' — that's now true. Slice-A lifecycle scenarios are exercised by local-docker-conformance.test.ts via runCodingAgentsIntegrationConformance. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…euristic Phase 3 Task 11. - sessionId now must match /^[A-Za-z0-9][A-Za-z0-9_-]*$/ (rejects leading dashes like '-rf'). Defence-in-depth — adapters shellQuote but the CLI boundary should reject obvious shell metacharacters even before the entity boundary. - isMain replaces `endsWith('import.js')` with `path.basename(...) === 'import.js'` so a consumer file with that suffix doesn't accidentally activate this CLI body when imported. Test added for the leading-dash rejection. cli-import unit tests 10/10. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Phase 3 Task 10. - model now matches /^[A-Za-z0-9._/:-]+$/. Without this, a model like `gpt-4";evil="x` would close the value quote in the `-c model="..."` flag and inject an arbitrary config key. Adapters are reachable from import CLI / spawn args paths and shouldn't trust caller-supplied input. - sessionId now matches /^[A-Za-z0-9-]+$/ (UUID shape) in buildCliInvocation, probeCommand, and captureCommand. Without this, a sessionId containing `*` or `?` would broaden the find glob silently and could match an unrelated transcript. Both validations throw early with a clear error rather than silently producing wrong argv. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Phase 3 Task 9. The handler races bridge.runTurn against runTimeoutMs via raceTimeout. Before this change, the loser's child process kept running — raceTimeout rejects the promise, but the orphan CLI stays attached to its sandbox until the sandbox itself tears down. Adds optional `signal?: AbortSignal` to RunTurnArgs. The bridge forwards an abort to handle.kill('SIGTERM') and rejects with a clear 'aborted (signal)' error if the abort fires before normal exit. The listener is removed on completion to avoid leaking the closure. The handler creates an AbortController with a setTimeout matching runTimeoutMs and passes its signal alongside the existing raceTimeout. The two coexist: raceTimeout owns the promise-level rejection; the signal owns the child-reaping. clearTimeout in finally avoids the timer firing post-completion. All unit tests pass (151/159; the 2 failing are pre-existing handler-resume flakes unrelated to this change). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Phase 3 Task 19: replaces a fixed setTimeout(2500) with a poll loop in slice-b.test.ts. R5 flagged the timing as flaky on slow CI — the idle timer fires at 1500 ms but the visible 'stopped'/'unknown' status transition includes container teardown which can push past the fixed wait. New helper polls every 100 ms with a 30 s ceiling. Bonus: typecheck-fix in conformance/integration.ts where the L2.13 scenario referenced a non-existent 'fork.failed' lifecycle event. The schema's enum uses 'kind.convert_failed' for both kind-converts and forks (see entity/collections.ts). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Phase 4 Tasks 12 + 13. T12 — CodingAgentSpawnDialog.canSubmit now enforces the target ⇄ workspaceMode invariants explicitly: - target='host' requires workspaceMode='bindMount' - target='sprites' requires workspaceMode='volume' The button-click handlers force the right workspace when toggling target, but a strict-mode double-render or future refactor could submit a stale combo. Belt-and-braces. T13 — EntityHeader Convert-target and Convert-kind dropdowns now disable when status is 'error', not just running/starting/stopping or destroyed. Converting from a failed prior turn risks acting on stale lastError state — the user should retry to clear the error first. 'cold' stays allowed (it's a state-mutation only; no sandbox is up). Both files typecheck clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Property: for any input string s and any partition s = p1+p2+...+pN, feeding the partitions sequentially into StreamQueue and calling end() produces a line sequence identical to feeding s whole. Catches future regressions of the C2 line-tail-buffer fix in providers/fly-sprites/exec-adapter.ts. 7 canonical inputs × 200 deterministic seeds = 1400 partitions checked. Sub-second runtime. StreamQueue is now exported from exec-adapter.ts so the test can import it directly without going through the full ExecHandle factory. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Property: at the 900_000-byte cap, accept LIMIT-1 and LIMIT, reject LIMIT+1, and reject any multibyte string whose UTF-8 byte length overflows even when its UTF-16 code-unit length doesn't. Catches future regressions of the C5 fix in StdioBridge that replaced prompt.length with Buffer.byteLength(prompt, 'utf8'). One test specifically pins the multibyte boundary case where the old code would have silently accepted a too-long prompt; another asserts the error message reports byte count (not chars) so users can correlate with the limit. 5 cases pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Property: every adapter's buildCliInvocation produces byte-stable argv across (kind × {prompt-only, with-model, with-session, with-both}) input shapes. Would have caught the opencode --print-logs accident at compile-time of the test suite, not at L2.1 runtime. Catches future drift in claude/codex/opencode argv shape — reviewer must explicitly approve any change via 'pnpm test -u'. 10 snapshots locked. opencode prompt-only / session-only shapes are skipped (the adapter requires a model). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Property: every adapter's probeCommand / captureCommand / postMaterialiseCommand must treat sessionId as data, not code. For each (adapter × adversarial input) pair, the resulting argv must either (a) not pass through sh -c (direct argv invocation is safe by construction), or (b) embed the adversarial input fully inside one sh-tokenized word's content, or (c) reject the input by throwing during the build (validation-throw is equivalent to safe). Adversarial corpus: 9 strings covering single-quote close-and-reopen, command substitution sigils, redirect operators, glob metacharacters, spaces, backslash, and the textbook '\'' escape attempt. Generalises the C6 fix (opencode shellQuote) — any future adapter that forgets to shell-quote a caller-controlled field will fail this test loudly. Tokenizer handles single-quote, double-quote (with the correct sh-style backslash escape rules: only $, \`, \", \\, \n are escapes inside double quotes), and outside-of-quotes backslash-escape. 7 cases pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Property: for every (initial status × control-plane message) pair, the resulting (final status, lastError-presence, lifecycle-events) is byte-stable against a checked-in snapshot. Catches state-machine drift over time. Any future handler change that alters a cell forces an explicit `pnpm test -u` and reviewer approval — the diff makes the new transition visible. Coverage: 7 statuses × 6 control-plane messages = 42 cells. The `prompt` message is excluded; it requires non-trivial bridge mocking and is covered end-to-end by the L2 conformance suite. Stub provider returns status='unknown' so reconcile orphan branches fire when initial status is 'running'. Stub bridge throws if called. Each cell isolates with a fresh handler+lm+wr triple. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Phase 4 Task 14. PUT /coding-agent/<name> returns once the entity row is durable, but the handler registration that consumes our 'pin' payload lands on a separate path. Several specs raced spawn → pin → goto → assert and saw an empty timeline because the pin landed before init wiring was complete. The race wasn't reliably reproducible in this session but matches the R4 review finding. waitForEntityReady polls GET /coding-agent/<name> for up to 5 s before sending the pin. When the entity is registered and discoverable, fire the pin as before. Verified: full Playwright suite still passes 31/31 (1 skipped) on the running dev server. No spec changes needed — helper is internal. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The race the reviewer described wasn't reproducible in this session; the Playwright suite was 31/31 green both with and without the poll. The 100 ms × N poll overhead doesn't earn its keep. If a flake actually surfaces, re-add a targeted poll. Speculative coverage isn't worth the runtime cost. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

L2.9 codified a chain-leak property in WorkspaceRegistry that I empirically verified does NOT manifest (R2 #5 was a false positive, documented in the regression-catch unit tests). The 3-way concurrent test passed today by accident of correctness, not because it caught a bug. L2.6 (2-way non-overlap with the lease serialising concurrent runs) covers the load-bearing property. The unit-level workspace-registry tests added in commit 2021c5d cover the chain-pointer cleanup invariant directly without spinning up real sandboxes. Net: −68 lines, no coverage regression. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The two-timer redundancy (raceTimeout + AbortController firing in parallel) was a code smell. The orphan-child cleanup is the sandbox layer's responsibility: - LocalDocker: container removal kills the child on next destroy(). - HostProvider: T8 added activeChildren tracking; stop()/destroy() SIGTERM with SIGKILL fallback. - FlySpriteProvider: WS close terminates the exec. Without T9, the bridge has no concept of timeouts — that's correct layering. raceTimeout in the handler rejects the runTurn promise; the sandbox's next teardown reaps the child. The 'instant SIGTERM on timeout' behaviour T9 added is nice-to-have but not load-bearing given the sandbox-level guarantees. Net: −41 lines (types + bridge + handler). 27/29 unit tests pass (the 2 failures are pre-existing handler-resume flakes). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

DRY pass on the post-review additions: shellQuote was duplicated in claude.ts, codex.ts, opencode.ts (3 copies of the same 3-line function). Extracted to agents/shell-quote.ts so a future fix to the quoting algorithm lands in one place. isInFlight was duplicated in processStop, processConvertTarget, processConvertKind, and the fork-source quiescence guard (4 copies of the same status === running || starting || stopping check). Extracted to a top-level helper in handler.ts. Also makes the intent more self-documenting at call sites. Net: −20 lines, single source of truth for two cross-cutting concepts. Verified: all unit tests pass (163/171, with 2 pre-existing handler-resume flakes unrelated). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Thin wrapper skill for the electric-ax-import CLI: detects the current workspace + session id from the active Claude Code session, then runs the import against a running electric-agents server. The session shows up as a coding-agent entity in the UI (observable / forkable). Lives at packages/coding-agents/claude-skills/electric-import/SKILL.md; install with cp -R into ~/.claude/skills/. README.md added to both the package root and claude-skills/ documents the install + the trigger phrases. A note in the package README clarifies the supported scope: importing makes the session observable, NOT injectable. Claude Code has no third-party API for pushing user-messages into a running interactive session — see the research summary in the May 2026 session notes. Codex / opencode equivalents documented as out-of-scope-for-now in the skill's own 'Out of scope' section. Verified locally: `cp -R` into ~/.claude/skills/ and the skill registers in the available-skills list at session start. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Reported: imported agent on target='host' showed 'Sandbox starting' in the timeline, even though no sandbox is actually booting — host provider does a quick stat() on the workspace and is otherwise an attach. The misleading lifecycle event fires every time the agent warms back up from cold (idle eviction → next prompt). Suppressing the sandbox.starting / sandbox.started lifecycle inserts when target='host'. Status transitions through 'starting' are preserved (state-machine consistency). sandbox.failed stays — host attach can still fail (workspace not a directory) and the failure is meaningful. L2.2 conformance still passes ('warm second prompt' asserts not.toContain('sandbox.starting'); the new behaviour is a strict superset). Persisted lifecycle rows on existing agents aren't retroactively cleaned. Fix only affects future cold-boots after the dev server picks up the rebuild. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ks render line-per-line Two bugs surfaced by inspecting an imported claude→codex fork. 1. Tool calls dropped (coding-agents): codex exec --json emits shell invocations as {type:'item.completed', item:{type:'command_execution', command, aggregated_output, exit_code}} The patched agent-session-protocol@0.0.2 only handles function_call / function_call_output items — command_execution is silently dropped. Result: every shell call codex made was invisible in the timeline. Fix: a small pre-pass (codex-command-shim.ts) expands each command_execution item into a function_call + function_call_output pair on the wire so asp's existing matchers fire. Order preserved (call before output, both share the item id so asp pairs them). Cheap and self-contained — no upstream patch maintenance. 2. Assistant code-block lines rendered as one mashed string (UI): Streamdown wraps each source line of a fenced code block in a <span class='block ...'>. styles.css already has a rule that forces those spans to display:block, but the rule is scoped to and AssistantMessageRow forgot the className. Result: spans stayed display:inline and 69 lines of a tree listing rendered as . Fix: add className='agent-ui-markdown' to AssistantMessageRow's wrapper. Mirrors AgentResponse.tsx (Horton's renderer, which already had it). Verified: typecheck clean, 163/171 unit tests pass (2 pre-existing handler-resume flakes). Send a fresh codex prompt to see both fixes land — existing events rows aren't retroactively rewritten. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Claude prefers ANTHROPIC_API_KEY over CLAUDE_CODE_OAUTH_TOKEN when both are present, treating the value verbatim as a plain API key. When the host's ANTHROPIC_API_KEY actually contains an OAuth subscription token (`sk-ant-oat...` — common when the dev shell inherited it from claude.ai's keychain bridge), the previous register.ts mirrored it into CLAUDE_CODE_OAUTH_TOKEN AND left ANTHROPIC_API_KEY in the forwarded env. Inside the sandbox (which has no keychain fallback), claude picks ANTHROPIC_API_KEY, hits the API as a plain key, and every turn fails with "Invalid API key" -> exit 1, stderr empty (the JSON error lands on stdout). Symptom in stream: `cli-exit:claude CLI exited 1. stderr=<empty>` for both bindMount and volume workspace types — workspaceType was a red herring; the failure is auth-shape-only. Fix: when ANTHROPIC_API_KEY starts with `sk-ant-oat`, promote to CLAUDE_CODE_OAUTH_TOKEN and delete ANTHROPIC_API_KEY before forwarding. Also extract the supplier as `defaultEnvSupplier` so it can be tested directly with an injected env source. Verified: spawning a fresh claude/sandbox/bindMount agent and a fresh claude/sandbox/volume agent both complete the turn successfully ("OK" assistant text), where on `main` they both fail identically. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…arget=host Adds regression coverage for commit b26d41e. The fix in handler.ts' processPrompt cold-boot block (skipping sandbox.starting/sandbox.started lifecycle inserts when meta.target === 'host') was correct, but had no unit test pinning it. Reproduced live: a freshly spawned host-target coding-agent on the running dev server still showed 'Sandbox starting' in its timeline because the running start-builtin process predated the dist rebuild and was holding the pre-fix module in memory. After restarting the handler, host agents emit zero sandbox.* lifecycle rows end-to-end (verified: PUT spawn → POST prompt → curl /main → empty lifecycle collection). Two unit tests added in entity-handler.test.ts: 1. cold → starting → idle on host: status transitions through 'starting' for state-machine consistency, but neither sandbox.starting nor sandbox.started ends up in the lifecycle collection. 2. error → cold → starting → idle on host (re-prompt after a prior CLI exit): the error fall-through resets to 'cold' and re-runs the cold-boot block, so the host suppression must hold there too. Both tests fail with the host gates removed (verified locally) and pass with them in place. sandbox.failed is intentionally untouched — host attach can fail (e.g. workspace not a directory) and that's meaningful. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…or sprites Surfaced by R2's investigation of agent FYuEorn_F7. Codex inside our Docker sandbox container can't run any shell command on macOS Docker Desktop — codex's inner bwrap-based command sandbox fails with 'bwrap: No permissions to create a new namespace, likely because the kernel does not allow non-privileged user namespaces.' Result: every shell tool call dies, codex silently produces no useful output, and the user sees an agent that does nothing. Codex 0.128 ships --dangerously-bypass-approvals-and-sandbox documented as 'intended solely for running in environments that are externally sandboxed.' That's exactly target=sandbox (Docker container) and target=sprites (sprite is the workspace and the isolation boundary). For target=host we leave codex's normal sandbox active — no outer isolation, codex's bwrap layer is the only one. Threaded `target` through: RunTurnArgs -> stdio-bridge -> CodingAgentAdapter.buildCliInvocation claude / opencode adapters ignore the new field. codex uses it. The existing argv-stability snapshot doesn't change (target wasn't part of the input shapes covered). Tests: 4 new direct assertions in adapter-argv.test.ts covering sandbox / sprites / host / undefined target. 16/16 pass on the contract suite. Full unit run: 171 passed (the 2 failing handler-resume tests are pre-existing and unrelated). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

balegas and others added 30 commits April 30, 2026 15:02

docs(plans): add Slice B implementation plan for coding-agents migration

b24a438

feat(coding-agents): add nativeJsonl collection schema and nativeSess…

31ace6f

…ionId to sessionMeta

test(coding-agents): unit test — onNativeLine already wired in StdioB…

05d2835

…ridge

feat(coding-agents): wire --resume <nativeSessionId> in StdioBridge

835b90c

feat(coding-agents): tee onNativeLine into nativeJsonl and capture na…

738e043

…tiveSessionId per turn Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

feat(coding-agents): register nativeJsonl collection in coding-agent …

e9d45e0

…entity definition

test(coding-agents): integration test for lossless resume (Slice B)

794771a

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

feat(agents): add spawn_coding_agent and prompt_coding_agent tools

a8e68ac

feat(agents): migrate Horton from spawn_coder/prompt_coder to spawn_c…

c061e06

…oding_agent/prompt_coding_agent

feat(agents): remove legacy coder entity (coding-session.ts, spawn-co…

64970c0

…der.ts) and unregister from bootstrap Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat(agents-runtime): remove legacy CodingSession types and useCoding…

b912bc7

…Agent implementation Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat(agents-server-ui): extend status colors for coding-agent states …

169fc37

…and add new tool cases

feat(agents-server-ui): add CodingAgentView, useCodingAgent, CodingAg…

0de9ff6

…entTimeline, CodingAgentSpawnDialog; delete legacy CodingSession components Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix(agents-server-ui): wire Pin/Release/Stop via REST /send and route…

14062bc

… spawn-dialog initialPrompt through initialMessage Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

docs(coding-agents): record Slice B final-review UI fixes in run report

c41dfa3

docs(coding-agents): add user-facing docs page and implementation review

3664cc7

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

feat(electric-ax): add dev script for host-mode services (only postgr…

d78a692

…es + electric in docker) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix(deps): patch agent-session-protocol@0.0.2 to read session_id (sna…

0fa2995

…ke_case) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat(agents-server-ui): render thinking/turn_aborted/permission/error…

74732f8

… event types in CodingAgentTimeline Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

docs(specs): amend resume mechanism description and entity URL conven…

af5aada

…tion Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

docs: document required 'from' field on /send HTTP endpoint

60012c9

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

balegas and others added 30 commits May 3, 2026 20:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(coding-agents): platform primitive (MVP through Fly Sprites + bug hunt)#4256

feat(coding-agents): platform primitive (MVP through Fly Sprites + bug hunt)#4256
balegas wants to merge 279 commits intomainfrom
coding-agents-slice-a

balegas commented May 1, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

balegas commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

MVP and Slice A–C₂ (foundation)

Conformance + cross-kind

Slice: opencode (third agent kind)

Slice: Fly Sprites provider

UI bug-hunt (Playwright MCP, 10 iterations)

Documentation

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

balegas commented May 1, 2026 •

edited

Loading