Skip to content

feat(coding-agents): platform primitive (MVP through Fly Sprites + bug hunt)#4256

Open
balegas wants to merge 279 commits intomainfrom
coding-agents-slice-a
Open

feat(coding-agents): platform primitive (MVP through Fly Sprites + bug hunt)#4256
balegas wants to merge 279 commits intomainfrom
coding-agents-slice-a

Conversation

@balegas
Copy link
Copy Markdown
Contributor

@balegas balegas commented May 1, 2026

Summary

Introduces @electric-ax/coding-agents — a platform primitive for spawning, supervising, and resuming coding-agent sandboxes. Migrates Horton off the legacy coding-session entity onto the new typed runtime API and ships a dedicated UI view. Currently supports claude / codex / opencode as agent kinds and sandbox / host / sprites as sandbox targets.

MVP and Slice A–C₂ (foundation)

  • MVP: LocalDockerProvider + StdioBridge running claude --print --output-format=stream-json inside a containerised workspace.
  • Slice A: Runtime API (ctx.spawnCodingAgent / ctx.observeCodingAgent), entity handler with reconcile, LifecycleManager (idle timer, pin refcount), WorkspaceRegistry (per-identity mutex, slug resolution).
  • Slice B: Lossless resume via post-turn transcript materialisation, Horton tool migration, legacy coding-session removal, dedicated UI components.
  • Slice C₁: SandboxInstance.copyTo (avoid argv ARG_MAX), probe-and-materialise resume, --env-file for secrets, idle-eviction wake.
  • Slice C₂: Codex parity via the agent adapter registry (agent-session-protocol's CodingAgentKind widened to claude+codex; UI kind picker; per-kind probe/capture/materialise hooks).

Conformance + cross-kind

  • Conformance suite (packages/coding-agents/src/conformance/integration.ts): 16 scenarios × N kinds × M targets, pluggable across providers. Catches drift between LocalDocker / Host / Sprites.
  • Cross-kind resume + fork: convert-kind swaps the CLI in place; fork (with from) spawns a sibling agent inheriting the parent's denormalised event history. Same-kind and cross-kind both supported.

Slice: opencode (third agent kind)

  • opencode-ai joins claude + codex as a first-class spawnable kind.
  • Curated model picker (defaults to openai/gpt-5.4-mini-fast); per-provider env-var fallback.
  • Cross-kind UI gated for opencode in v1 (discoverable absence; tooltip points at the deferred follow-up).

Slice: Fly Sprites provider

sprites.dev (Fly's purpose-built agentic-sandbox product) ships as a third sandbox target alongside sandbox (LocalDocker) and host.

  • All three coding-agent kinds work on sprites.
  • convert-kind works in place; fork within sprites carries conversation history.
  • Cross-provider transitions (sandbox/host ↔ sprites) are intentionally not supported — UI surfaces the option but disables it with an explanatory tooltip.
  • Round-1 implementation (early commits) was based on a doc-only recon of API rc30. Round-2 fixes (against the live rc43 server, post-merge bug-hunt) corrected:
    • Exec endpoint URL (api.sprites.dev/v1/sprites/{name}/exec, not the per-sprite URL)
    • Output frame multiplexing (1-byte stream-id prefix: 0x01 stdout, 0x02 stderr, 0x03 <code> exit)
    • Stdin via HTTP POST (the WS stdin protocol shifted between rc30 and rc43)
    • Bootstrap script (default Ubuntu image preinstalls claude/codex/gemini/node — only opencode-ai needs install, with --prefix=/usr/local)
    • Env file source + export (set -a; . file; set +a)
    • OAuth-token mirror (ANTHROPIC_API_KEY shaped as sk-ant-oat... mirrored to CLAUDE_CODE_OAUTH_TOKEN)
    • bootstrap.ts actually wires the sprites provider when SPRITES_TOKEN is set
  • Full bug-by-bug record: docs/superpowers/plans/2026-05-02-coding-agents-fly-sprites.md § Implementation findings — round 2.

UI bug-hunt (Playwright MCP, 10 iterations)

Drove the UI end-to-end through Playwright MCP against the live dev stack. Surfaced and shipped four fixes:

  • F-1: cleanup:sprites script's PREFIXES missed the production coding-agent- prefix; production-spawned leaks were invisible.
  • F-2: Pin / Release / Stop / Convert-target / Convert-kind buttons stayed enabled on destroyed entities; gated in EntityHeader.tsx + Playwright spec.
  • F-3: New pnpm cleanup:volumes operator script (mirrors cleanup:sprites). Lists/deletes orphaned coding-agent-workspace-* volumes; default skips still-mounted ones.
  • F-4: dev.mjs lost stream registry across up-after-down because the embedded DurableStreamTestServer kept state in memory. Now sets ELECTRIC_AGENTS_STREAMS_DATA_DIR=.local/dev-streams; clear-state wipes it alongside compose volumes.

Bug-hunt report: docs/superpowers/specs/2026-05-03-bug-hunt-report.md.

Documentation

  • packages/coding-agents/README.md restructured around current state: TOC, quick-reference table, lifecycle/inbox-message/status reference, targets capability matrix, cleanup utilities, consolidated TLs.
  • docs/superpowers/specs/2026-05-02-coding-agents-fly-sprites-design.md carries a header banner pointing at round-2 findings and inlines the corrections in §1.
  • docs/superpowers/plans/2026-05-02-coding-agents-fly-sprites.md includes both round-1 and round-2 implementation findings.

Test plan

  • pnpm -C packages/coding-agents test (unit) — 139 green; new sprites tests cover stream-id demux, exit-frame fallback, name format [a-z0-9-]+, bootstrap content, wiring smoke.
  • DOCKER=1 LocalDocker conformance — full 16 scenarios across claude + codex + opencode.
  • HOST_PROVIDER=1 host-provider conformance.
  • SPRITES=1 SPRITES_TOKEN=… sprites-wiring.e2e.test.ts green (2.5 s) — guards bootstrap-wiring + sprite name format regressions.
  • pnpm -C packages/agents-server-ui exec playwright testspawn-via-dialog.spec.ts (6 cases) + spawn-sprites.spec.ts (2 cases).
  • Manual UI walkthrough end-to-end: 10 bug-hunt iterations covering claude/codex/opencode × sandbox/host × volume/bindMount, convert-kind transcript carry, same-kind + cross-kind fork, pin/release/stop/kill lifecycle, sprites first-turn including bootstrap, horton.
  • Deferred: full fly-sprites-conformance.test.ts re-run under round-2 fixes (vitest verbose-reporter buffering issue; needs streaming-reporter or per-scenario splits).

🤖 Generated with Claude Code

balegas and others added 30 commits April 30, 2026 15:02
Slice B finishes the platform-primitive migration: resume via
nativeJsonl tee + cold-boot materialization, Horton tool migration to
spawn_coding_agent / prompt_coding_agent, full removal of the legacy
coder entity (source, tools, runtime types, UI, bootstrap), and full
UI revamp (CodingAgent* components, status enum extension, header
Pin/Release/Stop buttons, lifecycle row rendering). Plus a runtime-
level e2e test that closes the gap which hid Slice A's slug and
flat-schema bugs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The eager rebuild was scoped here to support state().workspace.sharedRefs
accuracy after server restart, but the UI indicator consuming that field
(sandbox provenance / 'shared with N' header) is also Slice C. Defer
eager rebuild to land alongside its consumer; keep Slice A's lazy
per-agent rebuild on first handler entry.
…tiveSessionId per turn

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds sanitiseCwd/materialiseResume helpers and calls materialiseResume
in processPrompt (after sandbox.started, before wr.acquire) so that
`claude --resume <sessionId>` finds its JSONL session file inside the
sandbox on every cold-boot.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Slice B's plan teed claude's stream-json stdout into nativeJsonl,
then materialized those lines as the resume file. That's wrong —
claude --resume reads its on-disk transcript, which has a different
shape than the stream-json stdout. The bridge captures the wrong
data; --resume rejects the malformed file.

Reworked:
- nativeJsonl is now a single-row transcript blob (key='current',
  nativeSessionId, content) holding the actual on-disk transcript.
- Handler reads the transcript via base64-piped docker exec after
  each successful turn (captureTranscript) and writes the full blob
  to nativeJsonl, overwriting the previous capture.
- Cold-boot materialize reads the single row's content and pipes it
  back via base64.
- Bridge no longer relies on agent-session-protocol@0.0.2 to extract
  session_id (it reads entry.sessionId in camelCase but claude emits
  session_id in snake_case, so the protocol returns ''). Bridge now
  parses the raw stdout JSON directly to extract session_id.
- onNativeLine is no longer used by the handler. The bridge still
  invokes it (Task 1.1's test still passes), but the handler doesn't
  subscribe.
- Updated handler-resume.test.ts seeds and assertion to match the
  new single-row schema.

Verified end-to-end with DOCKER=1 slice-b integration test (BANANA
roundtrip): turn 1 establishes "favorite fruit is BANANA"; turn 2
on a fresh sandbox correctly recalls "BANANA" from the materialized
transcript.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…der.ts) and unregister from bootstrap

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…Agent implementation

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…entTimeline, CodingAgentSpawnDialog; delete legacy CodingSession components

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… and Pin/Release/Stop buttons into router and sidebar

- Replace CodingSessionView/CODING_SESSION_ENTITY_TYPE with CodingAgentView/coding-agent in router.tsx
- Hoist useCodingAgent at router level; pass result as prop to CodingAgentView (single SSE connection)
- Pass db to EntityHeader for Pin/Release/Stop inbox-dispatch buttons
- Replace CodingSessionSpawnDialog with CodingAgentSpawnDialog in Sidebar; remove all legacy CODING_SESSION_ENTITY_TYPE refs
- Add @electric-ax/coding-agents workspace dep; fix Streamdown content->children prop
- UI typecheck clean

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Documents the migration completion: resume + Horton + legacy removal +
UI revamp. Captures the major mid-flight pivot (resume mechanism
changed from per-line stream-json tee to single-row on-disk
transcript capture) and the upstream agent-session-protocol bug
(reads sessionId camelCase but claude emits session_id snake_case).
All 32 coding-agents tests + 388 runtime + 44 agents pass green
with DOCKER=1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… spawn-dialog initialPrompt through initialMessage

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ge build path

Three real bugs surfaced when the agents-server-ui dist was rebuilt
post-Slice-B and the manual smoke test exercised an entity in the
browser:

1. router.tsx called useCodingAgent AFTER an early-return for
   missing selectedEntity → React error #310 ("rendered more hooks
   than during previous render"). Moved the hook call BEFORE the
   early return.

2. CodingAgentTimeline.tsx read tool_call payloads as
   `{toolName, args}` — but agent-session-protocol's ToolCallEvent
   uses `{tool, input}`. Tool calls rendered as generic "tool" with
   no name. Fixed to match the protocol's actual shape.

3. useCodingAgent.ts imported wire constants from
   @electric-ax/coding-agents — that package transitively pulls in
   node-only deps (LocalDockerProvider, StdioBridge, agent-session-
   protocol's randomUUID import) that vite can't externalize for
   the browser bundle. Hardcoded the four wire strings locally to
   break the import chain.

Plus two infra changes that enable iterating on the
agents-server image locally:

- packages/agents-server/Dockerfile: copy
  `packages/coding-agents/package.json` and add a build step for
  the `coding-agents` workspace package (it didn't exist when this
  Dockerfile was written; agents now depends on it).

- packages/electric-ax/docker-compose.full.yml: change agents-server
  pull_policy from `always` to `missing` so a locally-tagged
  `electricax/agents-server:local` image is honoured by the
  quickstart instead of failing with a registry pull error.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…es + electric in docker)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…quests

The /send endpoint requires a 'from' identifier (used to tag inbox
provenance). MessageInput sends 'user'; the new Pin/Release/Stop
buttons forgot it and got HTTP 400 'Missing required field: from'.
Added from='user' to all three. Verified with a curl POST that the
endpoint now returns 204 No Content.
… on cold boot

Previously, processPrompt unconditionally inserted sandbox.starting +
sandbox.started lifecycle rows on every prompt. Since lm.ensureRunning
is idempotent (returns the existing sandbox when one is running), warm
prompts produced misleading 'Sandbox starting' entries between every
turn in the UI timeline — the user's inference that the sandbox was
restarting was understandable but wrong.

Now both rows fire only when the prior status was 'cold'. Materialise
of nativeJsonl is also gated on cold-boot (no need to re-write the
file when the existing container already has it). Capture-transcript
remains unconditional (we want a fresh snapshot after each turn).

A small backfill path keeps sessionMeta.instanceId fresh on warm
prompts when it was unset for some reason.
…ke_case)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… event types in CodingAgentTimeline

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tion

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
balegas and others added 30 commits May 3, 2026 20:57
… catch

Phase 1 Task 1 from docs/superpowers/plans/2026-05-03-coding-agents-
post-review-followups.md was driven by R2 #5: "chain leak under
concurrent acquirers — first releaser's chain-pointer check never
matches, entries grow unbounded".

Empirical investigation: the bug doesn't manifest. Two new tests
(N=8 concurrent acquire→release tasks; non-overlap + completion
proof) confirm chainByIdentity drains to 0 after all releases and
no acquirer is dropped. The reason: the chain-delete branch is
gated by `remaining === 0` (last releaser), and at that point the
chain pointer is necessarily the last `link` set (no new acquirer
present). The two conditions are coupled, not racy.

Keeping the tests as a forward-looking regression catch — any future
refactor that introduces the failure mode the reviewer described
will fail loudly.

Plan task is closed as "not a bug, regression test added".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 1 Task 2; drives Phase 0 Task 0.2 (L2.10 conformance) green.

Before this change: an agent that ended a prior turn in status='error'
went straight to status='running' on the next prompt. wasCold was
false (only checks 'cold'), so no sandbox.starting lifecycle row was
emitted; the stale lastError stayed visible through completion.
State-machine paper claims error→cold→starting→running; reality was
error→running.

After: at top of processPrompt (after cancelIdleTimer), if status is
'error', flip to 'cold' and clear lastError so the cold-boot path
runs normally. Idempotent on every status that isn't 'error'.

Verified: L2.10 conformance passes 3/3 kinds on LocalDocker (35s).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 1 Task 3; drives Phase 0 Task 0.5 (L2.13 conformance) green.

Before this change: fork copied source events + nativeJsonl
unconditionally. If the source was running/starting/stopping,
events were still streaming when the fork's first-wake observed
them; convertNativeJsonl produced a transcript ending mid-message
and the fork's first --resume would corrupt state.

After: at the top of the fork branch (right after the events
collection check), if sourceMeta.status is running/starting/stopping
we emit kind.convert_failed with detail "fork rejected: source not
quiescent (status=...)", set lastError + status='error' on the
fork, and return. The user can re-fork once the source quiesces.

Verified: L2.13 conformance passes 3/3 kinds on LocalDocker; L2.7,
L2.8, L2.9, L2.10, L2.11, L2.12 also still pass (21/21 in scope).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 1 Task 4 from R2 #8.

L2.4 used to only assert runs[last].status === 'completed'. A provider
that returned 'running' for a stale agent would reach reconcile's
`isOrphaned` branch and transition status to `idle` — a different
end-state than `cold` (the path for stopped/unknown). The original
test passed for either outcome.

Adding `expect(finalMeta.status).toMatch(/^(cold|idle)$/)` locks both
branches into the contract — anything else (e.g. status stays
`running` because reconcile mis-classified) fails loudly.

Verified: still 3/3 on LocalDocker.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…mpotent close

Phase 2 Task 5 from R1 #4.

execWithStdinViaPost's writeStdin and closeStdin had two defects:
- writeStdin appended to stdinBuf with no guard. After closeStdin
  fired the actual POST, any further write was silently buffered
  and lost (the data was never sent — the request had already
  flushed).
- closeStdin re-fired `void start()` on every call. start() itself
  was idempotent via its `started` flag, but the contract is
  cleaner if closeStdin is itself a no-op the second time.

Fix: a `closed` flag. writeStdin throws if called after close;
closeStdin is a no-op the second time.

Defence-in-depth: StdioBridge calls writeStdin then closeStdin
sequentially per turn, so the new throw can't fire under normal
usage. The guard catches future caller misuse loudly instead of
silently swallowing data.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 2 Task 6 from R1 #5.

Sprites' exec API treats per-call env in a way that's both unstable
(rc30→rc43 protocol shift) and inconsistent (silently ignored when
the cmd is shell-wrapped). Today's WS / POST-stdin paths both
forwarded `req.env` via repeated `?env=KEY=VALUE&` query params; for
the conformance harness's claude-on-sprites tests the per-call
ANTHROPIC_API_KEY mirror reached the child only because the env
file already had it.

Fix: stage per-call env inside the wrapper script. wrapWithAgentEnv
now takes an optional `env` arg and emits an `export KEY=value` line
per pair, after sourcing /run/agent.env (so per-call values override
the file's defaults). Env-key validation rejects shell-unsafe names
to defend against caller-controlled injection. The WS and POST exec
URLs no longer carry env query params at all.

Verified: L1.5 (exec honours cwd and env) passes on the live sprites
API; sprites unit tests 18/18 green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 2 Task 7 from R1 #7.

The Sprites API paginates with `has_more` + `next_continuation_token`,
but the SpritesApiClient only fetched a single page. A sprite buried
past the first page would be silently missed by `findExisting`,
causing the next `createSprite` for the same agentId to 409 with
"name already exists". Same risk in the cleanup-sprites operator
script and the new conformance afterAll.

Adds `listAllSprites(opts)` that loops until `!has_more` (or runs
out of `next_continuation_token`), with a 50-page hard cap as a
defensive guard. Updates `findExisting`, `cleanup-sprites.ts`, and
the conformance afterAll to use it.

Sprites unit tests 18/18 green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 2 Task 8; drives Phase 0 Task 0.7 (L1.11 conformance) green on Host.

Before: HostProvider.stop() was a no-op. The SandboxProvider contract
(which L1.11 codifies) says stop() must terminate the running child
within N seconds; calling stop() while a turn was mid-exec on the
host left the child running unsupervised. LocalDocker met this by
implication (container removal kills the process); sprites by the WS
close path; host did not.

After: AgentRecord tracks `activeChildren: Set<ChildProcess>`. Each
spawn registers; child `exit`/`error` unregisters. stop() and
destroy() now SIGTERM every active child for the matching agent and
fall back to SIGKILL after a 5 s grace period via a shared
terminateChildren helper.

Verified: L1.10 + L1.11 both pass on Host (LocalDocker and sprites
conformance unaffected).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The file's own comment said 'delete in a follow-up once the conformance
suite has shipped for one release cycle' — that's now true. Slice-A
lifecycle scenarios are exercised by local-docker-conformance.test.ts
via runCodingAgentsIntegrationConformance.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…euristic

Phase 3 Task 11.

- sessionId now must match /^[A-Za-z0-9][A-Za-z0-9_-]*$/ (rejects
  leading dashes like '-rf'). Defence-in-depth — adapters shellQuote
  but the CLI boundary should reject obvious shell metacharacters
  even before the entity boundary.
- isMain replaces `endsWith('import.js')` with
  `path.basename(...) === 'import.js'` so a consumer file with that
  suffix doesn't accidentally activate this CLI body when imported.

Test added for the leading-dash rejection. cli-import unit tests 10/10.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 3 Task 10.

- model now matches /^[A-Za-z0-9._/:-]+$/. Without this, a model like
  `gpt-4";evil="x` would close the value quote in the
  `-c model="..."` flag and inject an arbitrary config key. Adapters
  are reachable from import CLI / spawn args paths and shouldn't trust
  caller-supplied input.
- sessionId now matches /^[A-Za-z0-9-]+$/ (UUID shape) in
  buildCliInvocation, probeCommand, and captureCommand. Without this, a
  sessionId containing `*` or `?` would broaden the find glob silently
  and could match an unrelated transcript.

Both validations throw early with a clear error rather than silently
producing wrong argv.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 3 Task 9.

The handler races bridge.runTurn against runTimeoutMs via raceTimeout.
Before this change, the loser's child process kept running — raceTimeout
rejects the promise, but the orphan CLI stays attached to its sandbox
until the sandbox itself tears down.

Adds optional `signal?: AbortSignal` to RunTurnArgs. The bridge
forwards an abort to handle.kill('SIGTERM') and rejects with a clear
'aborted (signal)' error if the abort fires before normal exit. The
listener is removed on completion to avoid leaking the closure.

The handler creates an AbortController with a setTimeout matching
runTimeoutMs and passes its signal alongside the existing raceTimeout.
The two coexist: raceTimeout owns the promise-level rejection; the
signal owns the child-reaping. clearTimeout in finally avoids the
timer firing post-completion.

All unit tests pass (151/159; the 2 failing are pre-existing
handler-resume flakes unrelated to this change).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 3 Task 19: replaces a fixed setTimeout(2500) with a poll loop in
slice-b.test.ts. R5 flagged the timing as flaky on slow CI — the idle
timer fires at 1500 ms but the visible 'stopped'/'unknown' status
transition includes container teardown which can push past the fixed
wait. New helper polls every 100 ms with a 30 s ceiling.

Bonus: typecheck-fix in conformance/integration.ts where the L2.13
scenario referenced a non-existent 'fork.failed' lifecycle event.
The schema's enum uses 'kind.convert_failed' for both kind-converts
and forks (see entity/collections.ts).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 4 Tasks 12 + 13.

T12 — CodingAgentSpawnDialog.canSubmit now enforces the
target ⇄ workspaceMode invariants explicitly:
  - target='host' requires workspaceMode='bindMount'
  - target='sprites' requires workspaceMode='volume'
The button-click handlers force the right workspace when toggling
target, but a strict-mode double-render or future refactor could
submit a stale combo. Belt-and-braces.

T13 — EntityHeader Convert-target and Convert-kind dropdowns now
disable when status is 'error', not just running/starting/stopping
or destroyed. Converting from a failed prior turn risks acting on
stale lastError state — the user should retry to clear the error
first. 'cold' stays allowed (it's a state-mutation only; no
sandbox is up).

Both files typecheck clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Property: for any input string s and any partition s = p1+p2+...+pN,
feeding the partitions sequentially into StreamQueue and calling end()
produces a line sequence identical to feeding s whole.

Catches future regressions of the C2 line-tail-buffer fix in
providers/fly-sprites/exec-adapter.ts. 7 canonical inputs × 200
deterministic seeds = 1400 partitions checked. Sub-second runtime.

StreamQueue is now exported from exec-adapter.ts so the test can
import it directly without going through the full ExecHandle factory.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Property: at the 900_000-byte cap, accept LIMIT-1 and LIMIT, reject
LIMIT+1, and reject any multibyte string whose UTF-8 byte length
overflows even when its UTF-16 code-unit length doesn't.

Catches future regressions of the C5 fix in StdioBridge that replaced
prompt.length with Buffer.byteLength(prompt, 'utf8'). One test
specifically pins the multibyte boundary case where the old code would
have silently accepted a too-long prompt; another asserts the error
message reports byte count (not chars) so users can correlate with the
limit.

5 cases pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Property: every adapter's buildCliInvocation produces byte-stable
argv across (kind × {prompt-only, with-model, with-session,
with-both}) input shapes.

Would have caught the opencode --print-logs accident at compile-time
of the test suite, not at L2.1 runtime. Catches future drift in
claude/codex/opencode argv shape — reviewer must explicitly approve
any change via 'pnpm test -u'.

10 snapshots locked. opencode prompt-only / session-only shapes are
skipped (the adapter requires a model).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Property: every adapter's probeCommand / captureCommand /
postMaterialiseCommand must treat sessionId as data, not code. For
each (adapter × adversarial input) pair, the resulting argv must
either (a) not pass through sh -c (direct argv invocation is safe by
construction), or (b) embed the adversarial input fully inside one
sh-tokenized word's content, or (c) reject the input by throwing
during the build (validation-throw is equivalent to safe).

Adversarial corpus: 9 strings covering single-quote close-and-reopen,
command substitution sigils, redirect operators, glob metacharacters,
spaces, backslash, and the textbook '\'' escape attempt.

Generalises the C6 fix (opencode shellQuote) — any future adapter
that forgets to shell-quote a caller-controlled field will fail this
test loudly. Tokenizer handles single-quote, double-quote (with the
correct sh-style backslash escape rules: only $, \`, \", \\,
\n are escapes inside double quotes), and outside-of-quotes
backslash-escape.

7 cases pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Property: for every (initial status × control-plane message) pair,
the resulting (final status, lastError-presence, lifecycle-events)
is byte-stable against a checked-in snapshot.

Catches state-machine drift over time. Any future handler change
that alters a cell forces an explicit `pnpm test -u` and reviewer
approval — the diff makes the new transition visible.

Coverage: 7 statuses × 6 control-plane messages = 42 cells. The
`prompt` message is excluded; it requires non-trivial bridge mocking
and is covered end-to-end by the L2 conformance suite.

Stub provider returns status='unknown' so reconcile orphan branches
fire when initial status is 'running'. Stub bridge throws if called.
Each cell isolates with a fresh handler+lm+wr triple.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 4 Task 14.

PUT /coding-agent/<name> returns once the entity row is durable, but
the handler registration that consumes our 'pin' payload lands on a
separate path. Several specs raced spawn → pin → goto → assert and
saw an empty timeline because the pin landed before init wiring was
complete. The race wasn't reliably reproducible in this session but
matches the R4 review finding.

waitForEntityReady polls GET /coding-agent/<name> for up to 5 s
before sending the pin. When the entity is registered and
discoverable, fire the pin as before.

Verified: full Playwright suite still passes 31/31 (1 skipped) on the
running dev server. No spec changes needed — helper is internal.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The race the reviewer described wasn't reproducible in this session;
the Playwright suite was 31/31 green both with and without the poll.
The 100 ms × N poll overhead doesn't earn its keep.

If a flake actually surfaces, re-add a targeted poll. Speculative
coverage isn't worth the runtime cost.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
L2.9 codified a chain-leak property in WorkspaceRegistry that I
empirically verified does NOT manifest (R2 #5 was a false positive,
documented in the regression-catch unit tests). The 3-way concurrent
test passed today by accident of correctness, not because it caught
a bug.

L2.6 (2-way non-overlap with the lease serialising concurrent runs)
covers the load-bearing property. The unit-level workspace-registry
tests added in commit 2021c5d cover the chain-pointer cleanup
invariant directly without spinning up real sandboxes.

Net: −68 lines, no coverage regression.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The two-timer redundancy (raceTimeout + AbortController firing in
parallel) was a code smell. The orphan-child cleanup is the sandbox
layer's responsibility:
  - LocalDocker: container removal kills the child on next destroy().
  - HostProvider: T8 added activeChildren tracking; stop()/destroy()
    SIGTERM with SIGKILL fallback.
  - FlySpriteProvider: WS close terminates the exec.

Without T9, the bridge has no concept of timeouts — that's correct
layering. raceTimeout in the handler rejects the runTurn promise; the
sandbox's next teardown reaps the child. The 'instant SIGTERM on
timeout' behaviour T9 added is nice-to-have but not load-bearing
given the sandbox-level guarantees.

Net: −41 lines (types + bridge + handler). 27/29 unit tests pass
(the 2 failures are pre-existing handler-resume flakes).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
DRY pass on the post-review additions:

shellQuote was duplicated in claude.ts, codex.ts, opencode.ts (3
copies of the same 3-line function). Extracted to agents/shell-quote.ts
so a future fix to the quoting algorithm lands in one place.

isInFlight was duplicated in processStop, processConvertTarget,
processConvertKind, and the fork-source quiescence guard (4 copies of
the same status === running || starting || stopping check). Extracted
to a top-level helper in handler.ts. Also makes the intent more
self-documenting at call sites.

Net: −20 lines, single source of truth for two cross-cutting concepts.

Verified: all unit tests pass (163/171, with 2 pre-existing
handler-resume flakes unrelated).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Thin wrapper skill for the electric-ax-import CLI: detects the current
workspace + session id from the active Claude Code session, then runs
the import against a running electric-agents server. The session shows
up as a coding-agent entity in the UI (observable / forkable).

Lives at packages/coding-agents/claude-skills/electric-import/SKILL.md;
install with cp -R into ~/.claude/skills/. README.md added to both the
package root and claude-skills/ documents the install + the trigger
phrases.

A note in the package README clarifies the supported scope: importing
makes the session observable, NOT injectable. Claude Code has no
third-party API for pushing user-messages into a running interactive
session — see the research summary in the May 2026 session notes.
Codex / opencode equivalents documented as out-of-scope-for-now in the
skill's own 'Out of scope' section.

Verified locally: `cp -R` into ~/.claude/skills/ and the skill
registers in the available-skills list at session start.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reported: imported agent on target='host' showed 'Sandbox starting'
in the timeline, even though no sandbox is actually booting — host
provider does a quick stat() on the workspace and is otherwise an
attach. The misleading lifecycle event fires every time the agent
warms back up from cold (idle eviction → next prompt).

Suppressing the sandbox.starting / sandbox.started lifecycle inserts
when target='host'. Status transitions through 'starting' are
preserved (state-machine consistency). sandbox.failed stays — host
attach can still fail (workspace not a directory) and the failure is
meaningful.

L2.2 conformance still passes ('warm second prompt' asserts
not.toContain('sandbox.starting'); the new behaviour is a strict
superset).

Persisted lifecycle rows on existing agents aren't retroactively
cleaned. Fix only affects future cold-boots after the dev server
picks up the rebuild.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ks render line-per-line

Two bugs surfaced by inspecting an imported claude→codex fork.

1. Tool calls dropped (coding-agents):
   codex exec --json emits shell invocations as
     {type:'item.completed', item:{type:'command_execution', command,
      aggregated_output, exit_code}}
   The patched agent-session-protocol@0.0.2 only handles function_call /
   function_call_output items — command_execution is silently dropped.
   Result: every shell call codex made was invisible in the timeline.

   Fix: a small pre-pass (codex-command-shim.ts) expands each
   command_execution item into a function_call + function_call_output
   pair on the wire so asp's existing matchers fire. Order preserved
   (call before output, both share the item id so asp pairs them).
   Cheap and self-contained — no upstream patch maintenance.

2. Assistant code-block lines rendered as one mashed string (UI):
   Streamdown wraps each source line of a fenced code block in a
   <span class='block ...'>. styles.css already has a rule that
   forces those spans to display:block, but the rule is scoped to
    and AssistantMessageRow forgot the
   className. Result: spans stayed display:inline and 69 lines of a
   tree listing rendered as .

   Fix: add className='agent-ui-markdown' to AssistantMessageRow's
   wrapper. Mirrors AgentResponse.tsx (Horton's renderer, which
   already had it).

Verified: typecheck clean, 163/171 unit tests pass (2 pre-existing
handler-resume flakes). Send a fresh codex prompt to see both fixes
land — existing events rows aren't retroactively rewritten.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Claude prefers ANTHROPIC_API_KEY over CLAUDE_CODE_OAUTH_TOKEN when both
are present, treating the value verbatim as a plain API key. When the
host's ANTHROPIC_API_KEY actually contains an OAuth subscription token
(`sk-ant-oat...` — common when the dev shell inherited it from
claude.ai's keychain bridge), the previous register.ts mirrored it into
CLAUDE_CODE_OAUTH_TOKEN AND left ANTHROPIC_API_KEY in the forwarded
env. Inside the sandbox (which has no keychain fallback), claude picks
ANTHROPIC_API_KEY, hits the API as a plain key, and every turn fails
with "Invalid API key" -> exit 1, stderr empty (the JSON error lands on
stdout).

Symptom in stream: `cli-exit:claude CLI exited 1. stderr=<empty>` for
both bindMount and volume workspace types — workspaceType was a red
herring; the failure is auth-shape-only.

Fix: when ANTHROPIC_API_KEY starts with `sk-ant-oat`, promote to
CLAUDE_CODE_OAUTH_TOKEN and delete ANTHROPIC_API_KEY before forwarding.
Also extract the supplier as `defaultEnvSupplier` so it can be tested
directly with an injected env source.

Verified: spawning a fresh claude/sandbox/bindMount agent and a fresh
claude/sandbox/volume agent both complete the turn successfully
("OK" assistant text), where on `main` they both fail identically.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…arget=host

Adds regression coverage for commit b26d41e. The fix in handler.ts'
processPrompt cold-boot block (skipping sandbox.starting/sandbox.started
lifecycle inserts when meta.target === 'host') was correct, but had no
unit test pinning it. Reproduced live: a freshly spawned host-target
coding-agent on the running dev server still showed 'Sandbox starting'
in its timeline because the running start-builtin process predated the
dist rebuild and was holding the pre-fix module in memory. After
restarting the handler, host agents emit zero sandbox.* lifecycle rows
end-to-end (verified: PUT spawn → POST prompt → curl /main → empty
lifecycle collection).

Two unit tests added in entity-handler.test.ts:
1. cold → starting → idle on host: status transitions through 'starting'
   for state-machine consistency, but neither sandbox.starting nor
   sandbox.started ends up in the lifecycle collection.
2. error → cold → starting → idle on host (re-prompt after a prior CLI
   exit): the error fall-through resets to 'cold' and re-runs the
   cold-boot block, so the host suppression must hold there too.

Both tests fail with the host gates removed (verified locally) and pass
with them in place. sandbox.failed is intentionally untouched — host
attach can fail (e.g. workspace not a directory) and that's meaningful.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…or sprites

Surfaced by R2's investigation of agent FYuEorn_F7. Codex inside our
Docker sandbox container can't run any shell command on macOS Docker
Desktop — codex's inner bwrap-based command sandbox fails with
'bwrap: No permissions to create a new namespace, likely because the
kernel does not allow non-privileged user namespaces.' Result: every
shell tool call dies, codex silently produces no useful output, and
the user sees an agent that does nothing.

Codex 0.128 ships --dangerously-bypass-approvals-and-sandbox documented
as 'intended solely for running in environments that are externally
sandboxed.' That's exactly target=sandbox (Docker container) and
target=sprites (sprite is the workspace and the isolation boundary).
For target=host we leave codex's normal sandbox active — no outer
isolation, codex's bwrap layer is the only one.

Threaded `target` through:
  RunTurnArgs -> stdio-bridge -> CodingAgentAdapter.buildCliInvocation

claude / opencode adapters ignore the new field. codex uses it. The
existing argv-stability snapshot doesn't change (target wasn't part
of the input shapes covered).

Tests: 4 new direct assertions in adapter-argv.test.ts covering
sandbox / sprites / host / undefined target. 16/16 pass on the contract
suite. Full unit run: 171 passed (the 2 failing handler-resume tests
are pre-existing and unrelated).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant