Skip to content

feat(studio): add Stop run button + graceful CLI interrupt — pairs with eval resume #1222

@christso

Description

@christso

Objective

Add a "Stop run" affordance — both a UI button on /jobs/:runId and graceful CLI signal handling — so users can interrupt a long-running eval without orphaning the subprocess or losing partial results. Today there is no programmatic stop:

  • Studio: the launch endpoint stores process: ChildProcess per run but exposes no DELETE/stop route. Closing the browser tab leaves the CLI subprocess running until it completes naturally.
  • CLI: no top-level SIGINT/SIGTERM handler. Ctrl+C hard-kills the eval mid-test. The only child.kill() calls in the codebase live inside agent providers (claude-cli, codex-cli, pi-cli) terminating their own per-test subprocess on timeout — not the orchestrator handling user interrupt.

This pairs naturally with the resume feature shipped in #1220: today the workflow for "I want to bail on this run" is kill the terminal → resume in Studio. With a Stop button it becomes click Stop → click Resume, all without leaving the browser.

Current state — what already works

  • Per-test results are flushed row-by-row into index.jsonl as tests complete, so any partial state is durable on disk and resumable. The "stop" feature does not need to invent persistence — only graceful termination.
  • eval-runner.ts already retains a process: ChildProcess reference per Studio-launched run, so the server can process.kill('SIGTERM') once an endpoint is added.

Gap discovered: incomplete partial runs are not surfaced as resumable

The current Studio Resume affordance from #1220 is keyed to runs that contain at least one execution_status: execution_error row.

That misses an important resumability case: a run can be interrupted after writing some successful rows and before executing the remaining planned tests, leaving a partial run that is still resumable in principle but has no execution_error rows.

Real repro:

  1. Start a multi-test eval.
  2. Let a few tests complete successfully.
  3. Kill the run before the remaining tests execute.
  4. Open the run detail page in Studio.

Observed:

  • The run contains only the completed ok rows.
  • There are no execution_error rows.
  • Studio shows only "Re-run with Filters".
  • No "Resume run" button is rendered.

Expected:

  • A partial run should be resumable when it is incomplete relative to the originally planned suite/test set, even if all recorded rows are currently ok.

This matters because resumability should support "continue later" workflows, not only "recover from execution errors."

Proposed changes

1. CLI signal handler

Register SIGINT / SIGTERM handlers at the top of apps/cli/src/commands/eval/run-eval.ts (or wherever the orchestrator entry point lives):

  • On first signal: set a stopRequested flag, allow in-flight tests to finish (they're already isolated), then exit cleanly with a non-zero code distinguishable from "crashed."
  • On second signal: hard exit (so users can still escape if a test is hung).
  • Print a concise message: Stop requested — waiting for N in-flight test(s) to finish (Ctrl+C again to force-quit).

2. Studio API: DELETE /api/eval/run/:id

Add a route that:

  • 404s if the run id is unknown.
  • 403s in read-only mode (matches the existing guard on POST).
  • 409s (or 200 with {stopped: false}) if the run is already terminal.
  • Otherwise calls run.process?.kill('SIGTERM'), sets run.status = 'stopping', returns 202.

The existing child.on('close') handler will flip the status to failed/finished when the CLI exits.

Add benchmark-scoped variant DELETE /api/benchmarks/:benchmarkId/eval/run/:id matching the existing pattern.

3. UI: "Stop run" button on /jobs/:runId

In apps/studio/src/routes/jobs/$runId.tsx:

  • Render a destructive-style button (red outline) when status === 'starting' or 'running' and not in read-only mode.
  • On click: DELETE /api/eval/run/:id, optimistic-flip the status indicator to "Stopping…".
  • After the run hits a terminal state, the existing UI already updates correctly.
  • Disable in read-only mode (UI-level, the API also 403s).

4. Resume metadata for incomplete partial runs

Tighten the run-detail resumability contract so the UI does not infer resumability solely from execution_error rows.

Possible shape:

  • compute resumability from run completeness relative to the planned suite/test set recorded in benchmark.json / launch metadata
  • surface explicit fields like is_resumable and resume_reason from the run-detail API
  • continue to support the existing execution-error case, but also treat truncated partial runs as resumable

5. Tests

  • Server: in apps/cli/test/commands/results/serve.test.ts, add cases for unknown id (404), read-only (403), and a happy-path stop using a fake long-running child.
  • CLI: a small test that sends SIGINT to a multi-test eval run and asserts (a) exit code is the "stopped" sentinel and (b) index.jsonl contains the rows for tests completed before the signal.
  • UI: pure helper for "should the stop button render?" — shouldShowStopButton(status, isReadOnly).
  • Resume UI: tests covering both resumable states: execution_error rows and incomplete partial runs with only ok rows.

Acceptance signals

  • CLI: SIGINT during a multi-test eval produces a clean exit and a partial index.jsonl containing all tests completed before the signal.
  • CLI: a second SIGINT within 1s force-quits.
  • Server: DELETE /api/eval/run/:id exists and is 403-guarded in read-only mode; benchmark-scoped variant works the same.
  • UI: a "Stop run" button renders on /jobs/:runId while running, hidden when terminal, hidden in read-only.
  • UI: clicking Stop, then navigating to the originating /runs/:runId, shows the partial run and the Resume run button from feat(studio): expose eval resumability — API + Resume action on run detail #1220 visible when either condition is true:
    • the run contains at least one execution_status: execution_error row, or
    • the run is incomplete relative to the originally planned suite/test set even though all recorded rows are ok.
  • UI: a fully completed successful run does not show Resume.
  • Manual red/green: red = on main, killing terminal mid-eval is the only way to stop; green = on this branch, the Stop button on /jobs/<id> terminates cleanly and the partial run is resumable in one click.

Non-goals

  • No "Pause" semantics. Stop fully terminates; resume is the way to continue.
  • No queue management. This is for one running job at a time — multi-job orchestration is out of scope.
  • No SIGINT-to-grader translation. If a grader is mid-flight when the signal arrives, let it finish or time out per existing rules.

Related

Estimate

~half a day. CLI signal handling is the biggest unknown (need to thread the flag through the worker pool); the UI + API changes are small.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions