Skip to content

Workflow: publish eval results#2093

Open
miguelg719 wants to merge 3 commits intomainfrom
miguelgonzalez/stg-1926-ci-evals-page-integration-2
Open

Workflow: publish eval results#2093
miguelg719 wants to merge 3 commits intomainfrom
miguelgonzalez/stg-1926-ci-evals-page-integration-2

Conversation

@miguelg719
Copy link
Copy Markdown
Collaborator

@miguelg719 miguelg719 commented May 6, 2026

why

what changed

test plan


Summary by cubic

Adds a GitHub Actions workflow and a script to publish Braintrust benchmark results to Upstash/Vercel KV for the Evals UI. It fetches an experiment, groups rows by model and agent mode, computes accuracy/speed/cost, merges into the latest dataset, and can write a per‑experiment key.

  • New Features

    • Adds Publish Evals workflow (.github/workflows/publish-evals.yml) with inputs (experiment, project, kv_key, experiment_key_prefix, write_experiment_key, dry_run). Posts a summary (experiment, project, keys, pass stats, dry run) and uploads evals-ui-data.json and publish-evals-ui-data-output.json.
    • Introduces packages/evals/scripts/publish-braintrust-ui-data.ts run via pnpm --filter @browserbasehq/stagehand-evals exec tsx. Infers benchmark key/label and model/provider, computes pass rate, avg duration (s), cost per task, and total cost, and includes agentMode when present. Merges into <kv_key>, sorts rows by accuracy then speed, and optionally writes <experiment_key_prefix>:<experiment-id> (--no-experiment-key to skip). Supports --dry-run, --dataset-id, and writes the merged payload to disk (evals-ui-data.json). Defaults: project stagehand, dataset id stagehand-evals, latest key stagehand:evals:latest, experiment prefix stagehand:evals:experiments.
  • Migration

    • Set secrets: BRAINTRUST_API_KEY and either UPSTASH_REDIS_REST_URL/UPSTASH_REDIS_REST_TOKEN or KV_REST_API_URL/KV_REST_API_TOKEN.
    • Run “Publish Evals” with the Braintrust experiment name or UUID; override inputs as needed, enable dry_run or disable write_experiment_key if desired.

Written for commit 900b504. Summary will update on new commits.

@changeset-bot
Copy link
Copy Markdown

changeset-bot Bot commented May 6, 2026

⚠️ No Changeset found

Latest commit: 900b504

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@miguelg719 miguelg719 marked this pull request as ready for review May 6, 2026 23:07
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 2 files

Confidence score: 4/5

  • This PR is likely safe to merge, with a modest logic risk: in packages/evals/scripts/publish-braintrust-ui-data.ts, benchCase.category is currently unreachable because benchCase.suite.replace(...) always yields a string.
  • The most significant impact is potential benchmark-source misclassification when dataset is missing, which is user-visible in published eval metadata but not likely to break core execution paths.
  • Given the reported severity (5/10) and a single focused issue, the risk appears contained rather than merge-blocking.
  • Pay close attention to packages/evals/scripts/publish-braintrust-ui-data.ts - unreachable fallback logic may misclassify source data when dataset is absent.
Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="packages/evals/scripts/publish-braintrust-ui-data.ts">

<violation number="1" location="packages/evals/scripts/publish-braintrust-ui-data.ts:291">
P2: `benchCase.category` is unreachable here because `benchCase.suite.replace(...)` always returns a string, so this fallback never executes. This can misclassify benchmark source when `dataset` is missing.</violation>
</file>
Architecture diagram
sequenceDiagram
    participant Dev as Developer (Manual Trigger)
    participant GHA as GitHub Actions Workflow
    participant Script as publish-braintrust-ui-data.ts
    participant Braintrust as Braintrust API
    participant Upstash as Upstash Redis
    participant Artifact as GitHub Artifacts

    Note over Dev,Artifact: Publish Evals Workflow

    Dev->>GHA: Trigger workflow_dispatch with inputs
    Note over Dev,GHA: experiment, project, kv_key, dry_run, etc.

    GHA->>GHA: Checkout code & setup node/pnpm

    GHA->>Script: Execute pnpm tsx script with parsed args
    Note over GHA,Script: Passes env vars (BRAINTRUST_API_KEY, UPSTASH_*)

    Script->>Braintrust: Fetch experiment data
    Note over Script,Braintrust: Uses experiment name/UUID & project

    Braintrust-->>Script: Return experiment results & bench cases

    Script->>Script: Infer benchmark key & label
    Note over Script: Extracts from dataset/suite/category

    Script->>Script: Infer model, provider, & provider key
    Note over Script: Parses model string for provider prefix

    Script->>Script: Compute pass rate, avg duration (speed), & cost
    Note over Script: Scans bench results for pass/fail, timing, & cost metrics

    Script->>Script: Build UiBenchmarkRow structure

    alt Not dry run
        Script->>Upstash: Read existing data from <kv_key>
        Upstash-->>Script: Return existing payload (or empty)

        Script->>Script: Merge new row into existing dataset
        Note over Script: Dedup by model+provider, sort by accuracy then speed

        Script->>Upstash: SET <kv_key> with merged payload
        Script->>Upstash: OPTIONAL: SET <experiment_key_prefix>:<experiment_id>
        Upstash-->>Script: Confirm write

        Script->>Script: Build output JSON with summary & keys written
    else Dry run
        Script->>Script: Build output JSON without Upstash calls
    end

    Script-->>GHA: Write evals-ui-data.json & publish-evals-ui-data-output.json

    GHA->>GHA: Parse output JSON and extract summary

    GHA->>GHA: Add publish summary to GitHub Step Summary

    GHA->>Artifact: Upload generated payload files
    Note over GHA,Artifact: Always uploads, even on failure
Loading

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review, or fix all with cubic.

function benchmarkSource(benchCase: BenchCaseRow): string | undefined {
return (
benchCase.dataset ??
benchCase.suite.replace(/^agent\//, "") ??
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot May 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: benchCase.category is unreachable here because benchCase.suite.replace(...) always returns a string, so this fallback never executes. This can misclassify benchmark source when dataset is missing.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At packages/evals/scripts/publish-braintrust-ui-data.ts, line 291:

<comment>`benchCase.category` is unreachable here because `benchCase.suite.replace(...)` always returns a string, so this fallback never executes. This can misclassify benchmark source when `dataset` is missing.</comment>

<file context>
@@ -0,0 +1,744 @@
+function benchmarkSource(benchCase: BenchCaseRow): string | undefined {
+  return (
+    benchCase.dataset ??
+    benchCase.suite.replace(/^agent\//, "") ??
+    benchCase.category
+  );
</file context>
Fix with Cubic

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant