Workflow: publish eval results by miguelg719 · Pull Request #2093 · browserbase/stagehand

miguelg719 · 2026-05-06T23:07:36Z

why

what changed

test plan

Summary by cubic

Adds a GitHub Actions workflow and a script to publish Braintrust benchmark results to Upstash/Vercel KV for the Evals UI. It fetches an experiment, groups rows by model and agent mode, computes accuracy/speed/cost, merges into the latest dataset, and can write a per‑experiment key.

New Features
- Adds Publish Evals workflow (.github/workflows/publish-evals.yml) with inputs (experiment, project, kv_key, experiment_key_prefix, write_experiment_key, dry_run). Posts a summary (experiment, project, keys, pass stats, dry run) and uploads evals-ui-data.json and publish-evals-ui-data-output.json.
- Introduces packages/evals/scripts/publish-braintrust-ui-data.ts run via pnpm --filter @browserbasehq/stagehand-evals exec tsx. Infers benchmark key/label and model/provider, computes pass rate, avg duration (s), cost per task, and total cost, and includes agentMode when present. Merges into <kv_key>, sorts rows by accuracy then speed, and optionally writes <experiment_key_prefix>:<experiment-id> (--no-experiment-key to skip). Supports --dry-run, --dataset-id, and writes the merged payload to disk (evals-ui-data.json). Defaults: project stagehand, dataset id stagehand-evals, latest key stagehand:evals:latest, experiment prefix stagehand:evals:experiments.
Migration
- Set secrets: BRAINTRUST_API_KEY and either UPSTASH_REDIS_REST_URL/UPSTASH_REDIS_REST_TOKEN or KV_REST_API_URL/KV_REST_API_TOKEN.
- Run “Publish Evals” with the Braintrust experiment name or UUID; override inputs as needed, enable dry_run or disable write_experiment_key if desired.

^{Written for commit 900b504. Summary will update on new commits.}

changeset-bot · 2026-05-06T23:07:40Z

⚠️ No Changeset found

Latest commit: 900b504

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

cubic-dev-ai

1 issue found across 2 files

Confidence score: 4/5

This PR is likely safe to merge, with a modest logic risk: in packages/evals/scripts/publish-braintrust-ui-data.ts, benchCase.category is currently unreachable because benchCase.suite.replace(...) always yields a string.
The most significant impact is potential benchmark-source misclassification when dataset is missing, which is user-visible in published eval metadata but not likely to break core execution paths.
Given the reported severity (5/10) and a single focused issue, the risk appears contained rather than merge-blocking.
Pay close attention to packages/evals/scripts/publish-braintrust-ui-data.ts - unreachable fallback logic may misclassify source data when dataset is absent.

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="packages/evals/scripts/publish-braintrust-ui-data.ts">

<violation number="1" location="packages/evals/scripts/publish-braintrust-ui-data.ts:291">
P2: `benchCase.category` is unreachable here because `benchCase.suite.replace(...)` always returns a string, so this fallback never executes. This can misclassify benchmark source when `dataset` is missing.</violation>
</file>

Architecture diagram

sequenceDiagram
    participant Dev as Developer (Manual Trigger)
    participant GHA as GitHub Actions Workflow
    participant Script as publish-braintrust-ui-data.ts
    participant Braintrust as Braintrust API
    participant Upstash as Upstash Redis
    participant Artifact as GitHub Artifacts

    Note over Dev,Artifact: Publish Evals Workflow

    Dev->>GHA: Trigger workflow_dispatch with inputs
    Note over Dev,GHA: experiment, project, kv_key, dry_run, etc.

    GHA->>GHA: Checkout code & setup node/pnpm

    GHA->>Script: Execute pnpm tsx script with parsed args
    Note over GHA,Script: Passes env vars (BRAINTRUST_API_KEY, UPSTASH_*)

    Script->>Braintrust: Fetch experiment data
    Note over Script,Braintrust: Uses experiment name/UUID & project

    Braintrust-->>Script: Return experiment results & bench cases

    Script->>Script: Infer benchmark key & label
    Note over Script: Extracts from dataset/suite/category

    Script->>Script: Infer model, provider, & provider key
    Note over Script: Parses model string for provider prefix

    Script->>Script: Compute pass rate, avg duration (speed), & cost
    Note over Script: Scans bench results for pass/fail, timing, & cost metrics

    Script->>Script: Build UiBenchmarkRow structure

    alt Not dry run
        Script->>Upstash: Read existing data from <kv_key>
        Upstash-->>Script: Return existing payload (or empty)

        Script->>Script: Merge new row into existing dataset
        Note over Script: Dedup by model+provider, sort by accuracy then speed

        Script->>Upstash: SET <kv_key> with merged payload
        Script->>Upstash: OPTIONAL: SET <experiment_key_prefix>:<experiment_id>
        Upstash-->>Script: Confirm write

        Script->>Script: Build output JSON with summary & keys written
    else Dry run
        Script->>Script: Build output JSON without Upstash calls
    end

    Script-->>GHA: Write evals-ui-data.json & publish-evals-ui-data-output.json

    GHA->>GHA: Parse output JSON and extract summary

    GHA->>GHA: Add publish summary to GitHub Step Summary

    GHA->>Artifact: Upload generated payload files
    Note over GHA,Artifact: Always uploads, even on failure

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review, or fix all with cubic.}

cubic-dev-ai · 2026-05-06T23:11:57Z

+function benchmarkSource(benchCase: BenchCaseRow): string | undefined {
+  return (
+    benchCase.dataset ??
+    benchCase.suite.replace(/^agent\//, "") ??


P2: benchCase.category is unreachable here because benchCase.suite.replace(...) always returns a string, so this fallback never executes. This can misclassify benchmark source when dataset is missing.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At packages/evals/scripts/publish-braintrust-ui-data.ts, line 291: <comment>`benchCase.category` is unreachable here because `benchCase.suite.replace(...)` always returns a string, so this fallback never executes. This can misclassify benchmark source when `dataset` is missing.</comment> <file context> @@ -0,0 +1,744 @@ +function benchmarkSource(benchCase: BenchCaseRow): string | undefined { + return ( + benchCase.dataset ?? + benchCase.suite.replace(/^agent\//, "") ?? + benchCase.category + ); </file context>

Publish eval results

5fcdb16

miguelg719 marked this pull request as ready for review May 6, 2026 23:07

cubic-dev-ai Bot reviewed May 6, 2026

View reviewed changes

miguelg719 added 2 commits May 7, 2026 15:23

linting

d06f816

update

900b504

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Workflow: publish eval results#2093

Workflow: publish eval results#2093
miguelg719 wants to merge 3 commits intomainfrom
miguelgonzalez/stg-1926-ci-evals-page-integration-2

miguelg719 commented May 6, 2026 •

edited by cubic-dev-ai Bot

Loading

Uh oh!

changeset-bot Bot commented May 6, 2026 •

edited

Loading

Uh oh!

cubic-dev-ai Bot left a comment

Uh oh!

cubic-dev-ai Bot May 6, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

miguelg719 commented May 6, 2026 • edited by cubic-dev-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

why

what changed

test plan

Summary by cubic

Uh oh!

changeset-bot Bot commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ No Changeset found

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

miguelg719 commented May 6, 2026 •

edited by cubic-dev-ai Bot

Loading

changeset-bot Bot commented May 6, 2026 •

edited

Loading

cubic-dev-ai Bot May 6, 2026 •

edited

Loading