Objective
Add a "Stop run" affordance — both a UI button on /jobs/:runId and graceful CLI signal handling — so users can interrupt a long-running eval without orphaning the subprocess or losing partial results. Today there is no programmatic stop:
- Studio: the launch endpoint stores
process: ChildProcess per run but exposes no DELETE/stop route. Closing the browser tab leaves the CLI subprocess running until it completes naturally.
- CLI: no top-level SIGINT/SIGTERM handler.
Ctrl+C hard-kills the eval mid-test. The only child.kill() calls in the codebase live inside agent providers (claude-cli, codex-cli, pi-cli) terminating their own per-test subprocess on timeout — not the orchestrator handling user interrupt.
This pairs naturally with the resume feature shipped in #1220: today the workflow for "I want to bail on this run" is kill the terminal → resume in Studio. With a Stop button it becomes click Stop → click Resume, all without leaving the browser.
Current state — what already works
- Per-test results are flushed row-by-row into
index.jsonl as tests complete, so any partial state is durable on disk and resumable. The "stop" feature does not need to invent persistence — only graceful termination.
eval-runner.ts already retains a process: ChildProcess reference per Studio-launched run, so the server can process.kill('SIGTERM') once an endpoint is added.
Gap discovered: incomplete partial runs are not surfaced as resumable
The current Studio Resume affordance from #1220 is keyed to runs that contain at least one execution_status: execution_error row.
That misses an important resumability case: a run can be interrupted after writing some successful rows and before executing the remaining planned tests, leaving a partial run that is still resumable in principle but has no execution_error rows.
Real repro:
- Start a multi-test eval.
- Let a few tests complete successfully.
- Kill the run before the remaining tests execute.
- Open the run detail page in Studio.
Observed:
- The run contains only the completed
ok rows.
- There are no
execution_error rows.
- Studio shows only "Re-run with Filters".
- No "Resume run" button is rendered.
Expected:
- A partial run should be resumable when it is incomplete relative to the originally planned suite/test set, even if all recorded rows are currently
ok.
This matters because resumability should support "continue later" workflows, not only "recover from execution errors."
Proposed changes
1. CLI signal handler
Register SIGINT / SIGTERM handlers at the top of apps/cli/src/commands/eval/run-eval.ts (or wherever the orchestrator entry point lives):
- On first signal: set a
stopRequested flag, allow in-flight tests to finish (they're already isolated), then exit cleanly with a non-zero code distinguishable from "crashed."
- On second signal: hard exit (so users can still escape if a test is hung).
- Print a concise message:
Stop requested — waiting for N in-flight test(s) to finish (Ctrl+C again to force-quit).
2. Studio API: DELETE /api/eval/run/:id
Add a route that:
- 404s if the run id is unknown.
- 403s in read-only mode (matches the existing guard on POST).
- 409s (or 200 with
{stopped: false}) if the run is already terminal.
- Otherwise calls
run.process?.kill('SIGTERM'), sets run.status = 'stopping', returns 202.
The existing child.on('close') handler will flip the status to failed/finished when the CLI exits.
Add benchmark-scoped variant DELETE /api/benchmarks/:benchmarkId/eval/run/:id matching the existing pattern.
3. UI: "Stop run" button on /jobs/:runId
In apps/studio/src/routes/jobs/$runId.tsx:
- Render a destructive-style button (red outline) when
status === 'starting' or 'running' and not in read-only mode.
- On click:
DELETE /api/eval/run/:id, optimistic-flip the status indicator to "Stopping…".
- After the run hits a terminal state, the existing UI already updates correctly.
- Disable in read-only mode (UI-level, the API also 403s).
4. Resume metadata for incomplete partial runs
Tighten the run-detail resumability contract so the UI does not infer resumability solely from execution_error rows.
Possible shape:
- compute resumability from run completeness relative to the planned suite/test set recorded in
benchmark.json / launch metadata
- surface explicit fields like
is_resumable and resume_reason from the run-detail API
- continue to support the existing execution-error case, but also treat truncated partial runs as resumable
5. Tests
- Server: in
apps/cli/test/commands/results/serve.test.ts, add cases for unknown id (404), read-only (403), and a happy-path stop using a fake long-running child.
- CLI: a small test that sends SIGINT to a multi-test eval run and asserts (a) exit code is the "stopped" sentinel and (b)
index.jsonl contains the rows for tests completed before the signal.
- UI: pure helper for "should the stop button render?" —
shouldShowStopButton(status, isReadOnly).
- Resume UI: tests covering both resumable states:
execution_error rows and incomplete partial runs with only ok rows.
Acceptance signals
Non-goals
- No "Pause" semantics. Stop fully terminates; resume is the way to continue.
- No queue management. This is for one running job at a time — multi-job orchestration is out of scope.
- No SIGINT-to-grader translation. If a grader is mid-flight when the signal arrives, let it finish or time out per existing rules.
Related
Estimate
~half a day. CLI signal handling is the biggest unknown (need to thread the flag through the worker pool); the UI + API changes are small.
Objective
Add a "Stop run" affordance — both a UI button on
/jobs/:runIdand graceful CLI signal handling — so users can interrupt a long-running eval without orphaning the subprocess or losing partial results. Today there is no programmatic stop:process: ChildProcessper run but exposes no DELETE/stop route. Closing the browser tab leaves the CLI subprocess running until it completes naturally.Ctrl+Chard-kills the eval mid-test. The onlychild.kill()calls in the codebase live inside agent providers (claude-cli, codex-cli, pi-cli) terminating their own per-test subprocess on timeout — not the orchestrator handling user interrupt.This pairs naturally with the resume feature shipped in #1220: today the workflow for "I want to bail on this run" is kill the terminal → resume in Studio. With a Stop button it becomes click Stop → click Resume, all without leaving the browser.
Current state — what already works
index.jsonlas tests complete, so any partial state is durable on disk and resumable. The "stop" feature does not need to invent persistence — only graceful termination.eval-runner.tsalready retains aprocess: ChildProcessreference per Studio-launched run, so the server canprocess.kill('SIGTERM')once an endpoint is added.Gap discovered: incomplete partial runs are not surfaced as resumable
The current Studio Resume affordance from #1220 is keyed to runs that contain at least one
execution_status: execution_errorrow.That misses an important resumability case: a run can be interrupted after writing some successful rows and before executing the remaining planned tests, leaving a partial run that is still resumable in principle but has no
execution_errorrows.Real repro:
Observed:
okrows.execution_errorrows.Expected:
ok.This matters because resumability should support "continue later" workflows, not only "recover from execution errors."
Proposed changes
1. CLI signal handler
Register
SIGINT/SIGTERMhandlers at the top ofapps/cli/src/commands/eval/run-eval.ts(or wherever the orchestrator entry point lives):stopRequestedflag, allow in-flight tests to finish (they're already isolated), then exit cleanly with a non-zero code distinguishable from "crashed."Stop requested — waiting for N in-flight test(s) to finish (Ctrl+C again to force-quit).2. Studio API:
DELETE /api/eval/run/:idAdd a route that:
{stopped: false}) if the run is already terminal.run.process?.kill('SIGTERM'), setsrun.status = 'stopping', returns202.The existing
child.on('close')handler will flip the status tofailed/finishedwhen the CLI exits.Add benchmark-scoped variant
DELETE /api/benchmarks/:benchmarkId/eval/run/:idmatching the existing pattern.3. UI: "Stop run" button on
/jobs/:runIdIn
apps/studio/src/routes/jobs/$runId.tsx:status === 'starting'or'running'and not in read-only mode.DELETE /api/eval/run/:id, optimistic-flip the status indicator to "Stopping…".4. Resume metadata for incomplete partial runs
Tighten the run-detail resumability contract so the UI does not infer resumability solely from
execution_errorrows.Possible shape:
benchmark.json/ launch metadatais_resumableandresume_reasonfrom the run-detail API5. Tests
apps/cli/test/commands/results/serve.test.ts, add cases for unknown id (404), read-only (403), and a happy-path stop using a fake long-running child.index.jsonlcontains the rows for tests completed before the signal.shouldShowStopButton(status, isReadOnly).execution_errorrows and incomplete partial runs with onlyokrows.Acceptance signals
index.jsonlcontaining all tests completed before the signal.DELETE /api/eval/run/:idexists and is 403-guarded in read-only mode; benchmark-scoped variant works the same./jobs/:runIdwhile running, hidden when terminal, hidden in read-only./runs/:runId, shows the partial run and the Resume run button from feat(studio): expose eval resumability — API + Resume action on run detail #1220 visible when either condition is true:execution_status: execution_errorrow, orok.main, killing terminal mid-eval is the only way to stop; green = on this branch, the Stop button on/jobs/<id>terminates cleanly and the partial run is resumable in one click.Non-goals
Related
packages/core/src/evaluation/providers/{claude-cli,codex-cli,pi-cli}.tsEstimate
~half a day. CLI signal handling is the biggest unknown (need to thread the flag through the worker pool); the UI + API changes are small.