Skip to content

feat(0.15.0): cross-worker sandbox-reconnect durability#23

Open
drewstone wants to merge 1 commit into
mainfrom
feat/sandbox-reconnect
Open

feat(0.15.0): cross-worker sandbox-reconnect durability#23
drewstone wants to merge 1 commit into
mainfrom
feat/sandbox-reconnect

Conversation

@drewstone
Copy link
Copy Markdown
Contributor

Summary

A 15-minute agentic sandbox turn must survive the Cloudflare worker isolate dying mid-turn (deploy roll, CPU limit, OOM). runDurableTurn already replays a completed turn, but an interrupted turn re-runs from the top — the producer's streamPrompt generator died with the isolate.

The Tangle sandbox container is orchestrator-managed and outlives the worker. This PR adds runReconnectableTurn: it checkpoints a RunHandle at turn start so a fresh worker re-attaches to the in-flight sandbox run instead of re-prompting.

  • RunHandle{ kind: 'sandbox' | 'tcloud', sandboxId?, sessionId?, runId?, status, cursor? }. A pointer to a substrate run that outlives the isolate.
  • runReconnectableTurn — three resolution paths on a retry: replayed (turn already finished — cached text replays), reconnected (a running handle survived — calls the product's reconnect(handle) callback), rerun/fresh (no reconnectable handle — produces live).
  • reconnect(handle) is product-supplied substrate glue. Sandbox products wire the SDK's event-replay endpoint (GET {runtimeUrl}/agents/run/{runId}/events?lastEventId={cursor}); tcloud products omit it and fall through to a clean re-run.
  • Storage: the handle is checkpointed as a completed step at index 0; the turn runs at index 1. Reuses the existing completeStep JSON-result path with zero schema change — a completed step is the only shape startOrResume returns to a retry, and the handle must be readable while the turn step is still running. A new durable_steps column would force a migration across all three stores plus a new store method.

This is a thin handle registry, not a second durable-execution framework — the sandbox runtime is the durable engine; agent-runtime just remembers the pointer.

Spike findings (@tangle-network/sandbox@0.1.2)

Cross-worker attach is feasible. streamPrompt's reconnect uses executionId (run id, carried on the execution.started SSE frame's data) + lastEventId (the SSE id: cursor). The runtime exposes GET {runtimeUrl}/agents/run/{executionId}/events?lastEventId={cursor}&format=sse, reachable from any process via the public SandboxConnection.runtimeUrl + authToken. The SDK does not expose a one-call resumeRun(executionId) — its reconnect loop is closure-local — so the raw replay fetch is product-owned, which is exactly why reconnect is a product-supplied callback.

Test plan

  • pnpm typecheck passes
  • pnpm test — 231/231 pass (18 new in run-handle.test.ts)
  • New tests run across the InMemory / FileSystem / D1-over-sqlite store matrix
  • Covered: fresh turn registers a handle; retry with a running handle calls reconnect not produce; completed handle replays; running handle with no reconnect falls through to re-run; reconnect-stream failure fails the run (error not swallowed); register advances the persisted cursor
  • Confirmed the FileSystem-store concurrent-write race fix (handle writes drained before the turn-step write) is stable across 5 consecutive runs

A 15-minute agentic sandbox turn must survive the Cloudflare worker
isolate dying mid-turn. `runDurableTurn` already replays a *completed*
turn, but an *interrupted* one re-runs from the top — the producer's
`streamPrompt` generator died with the isolate.

The sandbox container is orchestrator-managed and outlives the worker.
`runReconnectableTurn` checkpoints a `RunHandle` — `{ kind, sandboxId,
sessionId, runId, status, cursor }` — at turn start. On a retry that
finds a `running` handle, a fresh worker calls a product-supplied
`reconnect(handle)` callback (which wires the sandbox SDK's event-replay
endpoint) instead of re-prompting. tcloud products omit `reconnect` and
fall through to a clean re-run.

The handle is checkpointed as a completed step at index 0; the turn runs
at index 1. This reuses the existing `completeStep` JSON-result path
with zero schema change — a completed step is the only shape
`startOrResume` returns to a retry, and the handle must be readable
while the turn step is still `running`.

Tests cover fresh / reconnected / replayed / rerun / reconnect-failure
across the InMemory / FileSystem / D1 store matrix.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant