Skip to content

feat: remote image builder for the TypeScript SDK with SIGINT cancellation and unified retry policy#78

Merged
mohamedveron merged 3 commits into
mainfrom
CSB-1454
May 22, 2026
Merged

feat: remote image builder for the TypeScript SDK with SIGINT cancellation and unified retry policy#78
mohamedveron merged 3 commits into
mainfrom
CSB-1454

Conversation

@christianalfoni
Copy link
Copy Markdown
Collaborator

@christianalfoni christianalfoni commented May 18, 2026

Optimize image-based snapshots through the remote builder

❌ Current behavior

sdk.snapshots.create({ image: "node:22" })
  └─> api.createSnapshot({ image: "node:22", architecture: "amd64" })
      └─> just registers the public reference — no build, no optimization
      └─> cold-start pulls the raw image from its source registry
      └─> slow first-boot, no nydus lazy-pull

✅ New behavior

sdk.snapshots.create({ image: "node:22" })
  └─> mkdtemp() → /tmp/together-sandbox-XXXX/node/
  └─> write Dockerfile: "FROM node:22"
  └─> _buildRemotely(contextDir)
        └─> POST /builds to image-builder service
        └─> stream SSE build logs
        └─> nydus-convert the result (lazy-pull layers)
        └─> push to internal registry
  └─> api.createSnapshot({ image: <built-ref>, architecture })
  └─> rm -rf temp dir (finally)
  └─> fast cold-starts via nydus lazy-pull on subsequent sandbox boots

🤔 Assumptions

  • The remote image builder accepts a one-line FROM <image> Dockerfile and produces a nydus-optimized image in the internal registry — no special-case API on the builder side needed.
  • Nydus-optimized images in the internal registry give meaningfully faster cold-starts than re-pulling raw public images on every sandbox boot.
  • A meaningful slug derived from the image name (e.g. ghcr.io/foo/bar:latestbar) is preferable to a random UUID for the build's image name, since the existing _buildRemotely derives the name from path.basename(contextDir).

🧠 Decisions

  • Reuse _buildRemotely rather than adding a new builder code path: write a synthetic FROM <image> Dockerfile into a temp folder and delegate. Keeps one remote-build implementation; nydus conversion happens for free because _buildRemotely already passes nydus: true.
  • mkdtemp + finally cleanup so the temp directory is removed regardless of build outcome (success, failure, Ctrl+C).
  • Slug derivation from the image's last path component, stripped of tag/digest, lowercased, _-. Falls back to "image" for pathological inputs that strip to empty.

🧪 Testing

  • TypeScript LSP reports no diagnostics on changed files.
  • End-to-end remote build flow tested locally

📁 References

  • together-sandbox-typescript/src/Snapshots.ts
  • together-sandbox-typescript/src/RemoteImageBuilder.ts

Tear down server-side builds on Ctrl+C

❌ Current behavior

Ctrl+C during a remote build:
  TS:     no SIGINT handler → process dies, build keeps running server-side
  Python: cancel() runs, but a second Ctrl+C can abort the DELETE
         → orphaned build pod left running

✅ New behavior

Ctrl+C during a remote build:
  TS:     process.once("SIGINT", () => cancel(buildId) → exit(130))
  Python: await asyncio.shield(self.cancel(build_id))   ← survives 2nd Ctrl+C
         → server tears down the build pod, no orphans
sequenceDiagram
  participant U as User
  participant SDK as SDK
  participant API as Build API
  participant Pod as Build Pod

  U->>SDK: snapshots.create({ image })
  SDK->>API: POST /builds (FROM <image>)
  API-->>SDK: { build_id }
  SDK->>API: GET /builds/{id}/logs (SSE)
  Note over U,SDK: Ctrl+C →
  U->>SDK: SIGINT
  SDK->>API: DELETE /builds/{id}
  API->>Pod: tear down
Loading

🤔 Assumptions

  • Node does not propagate SIGINT into pending async code, so an explicit process.once("SIGINT", ...) handler is required to issue the DELETE before exiting with status 130.
  • A second Ctrl+C while the DELETE is in flight should not abort it.

🧠 Decisions

  • once + removeListener in TS so repeated build() calls in the same process (tests, daemons) don't accumulate stale SIGINT handlers.
  • asyncio.shield(self.cancel(...)) in Python guards against a second Ctrl+C aborting the DELETE mid-flight and leaving an orphaned pod.

🧪 Testing

  • Not tested end-to-end. Behavior verified by inspection against Node/asyncio cancellation semantics.

📁 References

  • together-sandbox-typescript/src/RemoteImageBuilder.ts
  • together-sandbox-python/together_sandbox/_remote_image_builder.py

Faster, consistent retries when streaming remote build logs

❌ Current behavior

SSE log stream retries (before):
  - Python: only 404 retried, 5xx raised immediately
  - TS:     regex /\b404\b/ on error message — brittle
  - Budget: 60 attempts × 5s wait  →  up to ~5 min before giving up

✅ New behavior

SSE log stream retries (after):
  - Both SDKs retry on RETRYABLE_STATUS_CODES ∪ {404}
      → 408, 429, 500, 502, 503, 504, 404
  - 404 = build pod hasn't materialised yet
  - Any other status code throws on attempt 1
  - Transport errors (timeout / DNS / TLS / EOF) → existing retry path
  - Budget: 5 attempts × 1s wait  →  ~5s before giving up

🤔 Assumptions

  • 5xx on the SSE log endpoint indicates a transient gateway/infra issue, not that the build itself died — the build continues server-side and a reconnect picks up where the stream left off.
  • ~5 seconds is enough headroom for transient pod-scheduling and gateway flakes; longer waits hurt UX more than they help reliability.
  • The eventsource npm library sets evt.code only for HTTP responses; transport failures leave code undefined and already flow through the resolve-with-sawDone=false retry path.

🧠 Decisions

  • Immutable retryable set: new Set([...RETRYABLE_STATUS_CODES, 404]) instead of mutating the shared export, so the 404-retry stays scoped to the SSE stream and doesn't leak into other callApi consumers.
  • Strict status parsing in the catch block: unknown-status errors (e.g. EventSource constructor failures from malformed URLs) throw immediately rather than burn the retry budget on programmer errors.
  • Match Python's branch structure: except HTTPStatusError covers HTTP retries; except Exception already covers transport errors — TS reaches the same outcome via its two-path topology.

🔄 Discussions

  • Considered loosening the TS predicate to retry on unknown status codes for stricter Python except Exception: parity. Verified via the eventsource library source that transport failures always take the resolve(sawDone=false) path, so the strict predicate is correct.
  • Considered RETRYABLE_STATUS_CODES.add(404) to share the set, but that would silently extend retry behavior for every other callApi consumer — chose a local spread-constructed set instead.

🧪 Testing

  • TypeScript LSP reports no diagnostics on changed files.
  • Behavior cross-checked against the Python implementation for matrix of: 404 / 5xx / 4xx / timeout / fatal SSE event / stream EOF — both stacks now reach the same retry-vs-throw outcome.

📁 References

  • together-sandbox-typescript/src/RemoteImageBuilder.ts
  • together-sandbox-typescript/src/utils.ts
  • together-sandbox-python/together_sandbox/_remote_image_builder.py
  • together-sandbox-python/together_sandbox/_utils.py

@christianalfoni christianalfoni changed the title feat: use same retry status codes for remote build and reduce retry time and tries feat: remote image builder for the TypeScript SDK with SIGINT cancellation and unified retry policy May 18, 2026
@mohamedveron mohamedveron merged commit 50a3fb9 into main May 22, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants