Skip to content

feat(studio): expose eval resumability — API + Resume action on run detail #1219

@christso

Description

@christso

Objective

Surface the existing CLI resume mechanics (--resume, --rerun-failed, --output <dir>) in Studio so a user staring at an interrupted or partially-errored run can finish it from the web UI instead of dropping to a terminal.

Today Studio can launch a fresh eval (POST /api/eval/run) and renders execution_error per test on the run detail page, but the launch request shape doesn't carry the resume parameters and no UI affordance exists.

Follow-up to #1216 / PR #1217, which scoped TUI + flag-level UX + docs + auto-detect but explicitly deferred Studio.

Background — current state in code

  • Launch endpoint: apps/cli/src/commands/results/eval-runner.ts:240 (unscoped) and apps/cli/src/commands/results/eval-runner.ts:407 (benchmark-scoped). Both call buildCliArgs and spawn the CLI.
  • Request shape: RunEvalRequest at apps/cli/src/commands/results/eval-runner.ts:101 — has suite_filter, test_ids, target, threshold, workers, dry_run. Missing resume, rerun_failed, retry_errors, output.
  • CLI arg builder: buildCliArgs at apps/cli/src/commands/results/eval-runner.ts:110.
  • UI client: apps/studio/src/lib/api.ts:529 (runEval function).
  • Run detail route: apps/studio/src/routes/runs/$runId.tsx and benchmark variant apps/studio/src/routes/benchmarks/\$benchmarkId_/runs/$runId.tsx.
  • Run detail component: apps/studio/src/components/RunDetail.tsx:174 already renders executionStatus === 'execution_error' per row, so the data needed to decide "is there anything to resume" is already on the page.
  • Job polling page: apps/studio/src/routes/jobs/$runId.tsx (existing — reuse for the post-resume status view).
  • Read-only guard: the launch endpoint already rejects in read-only mode (eval-runner.ts:241); the new behaviour must respect this.

Proposed changes

1. Extend the launch API (server)

Add to RunEvalRequest:

interface RunEvalRequest {
  // ...existing fields...
  resume?: boolean;
  rerun_failed?: boolean;
  retry_errors?: string;  // path to a prior run dir or index.jsonl
  output?: string;        // explicit run dir; required when resume/rerun_failed are set
                          // and the server isn't auto-detecting from cache
}

Wire format is snake_case per AGENTS.md ("Wire Format Convention"). Validation:

Extend buildCliArgs to translate these into --resume, --rerun-failed, --retry-errors <path>, --output <dir>.

2. Add UI action on the run detail page

On /runs/:runId (and the benchmark-scoped equivalent), when the loaded run contains at least one result with executionStatus === 'execution_error':

  • Render a primary button labelled "Resume run" that calls POST /api/eval/run with { suite_filter: <run's suite filter>, target: <run's target>, output: <run dir>, resume: true }.
  • Render a secondary button "Rerun failed cases" that does the same with rerun_failed: true instead of resume: true. (Same in-place semantics as the CLI flag — re-runs everything that wasn't executionStatus === 'ok'.)
  • After POST, redirect to /jobs/:runId (existing route) to show progress.
  • Disable both buttons in read-only mode.

UI placement: top-right of the RunDetail header is fine — keep it visible without scrolling.

3. Tests

  • Server tests in apps/cli/test/commands/results/serve.test.ts (existing file): add cases for valid resume/rerun_failed/retry_errors requests, mutual-exclusivity rejections, and the read-only guard.
  • UI tests: assert the button only renders when the run has at least one execution_error row; assert the request body shape on click; assert read-only hides/disables the buttons.

Acceptance signals

  • RunEvalRequest accepts resume, rerun_failed, retry_errors, output (snake_case keys).
  • CLI is spawned with the corresponding flags; verified by inspecting the command field returned in the launch response.
  • Mutual-exclusivity validation returns 400 with a usable error message.
  • /runs/:runId shows a "Resume run" button when any row has executionStatus === 'execution_error'; clicking it triggers a launch with resume: true + output: <runDir> and redirects to /jobs/:runId.
  • "Rerun failed cases" button works analogously with rerun_failed: true.
  • Read-only mode hides or disables both buttons (button-level UX, not just the 403 from the server).
  • Manual red/green UAT documented in the PR: red = launch a deliberately failing eval, observe execution_error rows, no resume button on main; green = same scenario, click Resume, observe new run dir reuses the same path and the previously-passing tests are skipped.

Non-goals

  • No /runs list filter for incomplete runs. Add the action where users already are (the detail page); broader filters can be a separate, smaller issue if usage warrants.
  • No new resume verbs. Surface the three existing CLI flags; don't invent a fourth.
  • No --retry-errors <path> UI picker. The path-based variant is for cross-run cases; in-Studio resume targets the run currently being viewed, so output: <currentRunDir> is sufficient.
  • No scheduled / auto-resume. Manual button click only.
  • No changes to the run-launch wizard / form for new runs — this issue is about resuming existing runs.

Related

Estimate

~1 day. Server change is mechanical (one interface, one arg builder, validation, tests). UI change is one button + one route handler + tests. No design work needed — peers (promptfoo cloud) put resume actions on run detail pages too.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    In progress

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions