Workflow: publish eval results#2093
Conversation
|
There was a problem hiding this comment.
1 issue found across 2 files
Confidence score: 4/5
- This PR is likely safe to merge, with a modest logic risk: in
packages/evals/scripts/publish-braintrust-ui-data.ts,benchCase.categoryis currently unreachable becausebenchCase.suite.replace(...)always yields a string. - The most significant impact is potential benchmark-source misclassification when
datasetis missing, which is user-visible in published eval metadata but not likely to break core execution paths. - Given the reported severity (5/10) and a single focused issue, the risk appears contained rather than merge-blocking.
- Pay close attention to
packages/evals/scripts/publish-braintrust-ui-data.ts- unreachable fallback logic may misclassify source data whendatasetis absent.
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="packages/evals/scripts/publish-braintrust-ui-data.ts">
<violation number="1" location="packages/evals/scripts/publish-braintrust-ui-data.ts:291">
P2: `benchCase.category` is unreachable here because `benchCase.suite.replace(...)` always returns a string, so this fallback never executes. This can misclassify benchmark source when `dataset` is missing.</violation>
</file>
Architecture diagram
sequenceDiagram
participant Dev as Developer (Manual Trigger)
participant GHA as GitHub Actions Workflow
participant Script as publish-braintrust-ui-data.ts
participant Braintrust as Braintrust API
participant Upstash as Upstash Redis
participant Artifact as GitHub Artifacts
Note over Dev,Artifact: Publish Evals Workflow
Dev->>GHA: Trigger workflow_dispatch with inputs
Note over Dev,GHA: experiment, project, kv_key, dry_run, etc.
GHA->>GHA: Checkout code & setup node/pnpm
GHA->>Script: Execute pnpm tsx script with parsed args
Note over GHA,Script: Passes env vars (BRAINTRUST_API_KEY, UPSTASH_*)
Script->>Braintrust: Fetch experiment data
Note over Script,Braintrust: Uses experiment name/UUID & project
Braintrust-->>Script: Return experiment results & bench cases
Script->>Script: Infer benchmark key & label
Note over Script: Extracts from dataset/suite/category
Script->>Script: Infer model, provider, & provider key
Note over Script: Parses model string for provider prefix
Script->>Script: Compute pass rate, avg duration (speed), & cost
Note over Script: Scans bench results for pass/fail, timing, & cost metrics
Script->>Script: Build UiBenchmarkRow structure
alt Not dry run
Script->>Upstash: Read existing data from <kv_key>
Upstash-->>Script: Return existing payload (or empty)
Script->>Script: Merge new row into existing dataset
Note over Script: Dedup by model+provider, sort by accuracy then speed
Script->>Upstash: SET <kv_key> with merged payload
Script->>Upstash: OPTIONAL: SET <experiment_key_prefix>:<experiment_id>
Upstash-->>Script: Confirm write
Script->>Script: Build output JSON with summary & keys written
else Dry run
Script->>Script: Build output JSON without Upstash calls
end
Script-->>GHA: Write evals-ui-data.json & publish-evals-ui-data-output.json
GHA->>GHA: Parse output JSON and extract summary
GHA->>GHA: Add publish summary to GitHub Step Summary
GHA->>Artifact: Upload generated payload files
Note over GHA,Artifact: Always uploads, even on failure
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review, or fix all with cubic.
| function benchmarkSource(benchCase: BenchCaseRow): string | undefined { | ||
| return ( | ||
| benchCase.dataset ?? | ||
| benchCase.suite.replace(/^agent\//, "") ?? |
There was a problem hiding this comment.
P2: benchCase.category is unreachable here because benchCase.suite.replace(...) always returns a string, so this fallback never executes. This can misclassify benchmark source when dataset is missing.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At packages/evals/scripts/publish-braintrust-ui-data.ts, line 291:
<comment>`benchCase.category` is unreachable here because `benchCase.suite.replace(...)` always returns a string, so this fallback never executes. This can misclassify benchmark source when `dataset` is missing.</comment>
<file context>
@@ -0,0 +1,744 @@
+function benchmarkSource(benchCase: BenchCaseRow): string | undefined {
+ return (
+ benchCase.dataset ??
+ benchCase.suite.replace(/^agent\//, "") ??
+ benchCase.category
+ );
</file context>
why
what changed
test plan
Summary by cubic
Adds a GitHub Actions workflow and a script to publish Braintrust benchmark results to Upstash/Vercel KV for the Evals UI. It fetches an experiment, groups rows by model and agent mode, computes accuracy/speed/cost, merges into the latest dataset, and can write a per‑experiment key.
New Features
Publish Evalsworkflow (.github/workflows/publish-evals.yml) with inputs (experiment,project,kv_key,experiment_key_prefix,write_experiment_key,dry_run). Posts a summary (experiment, project, keys, pass stats, dry run) and uploadsevals-ui-data.jsonandpublish-evals-ui-data-output.json.packages/evals/scripts/publish-braintrust-ui-data.tsrun viapnpm --filter @browserbasehq/stagehand-evals exec tsx. Infers benchmark key/label and model/provider, computes pass rate, avg duration (s), cost per task, and total cost, and includesagentModewhen present. Merges into<kv_key>, sorts rows by accuracy then speed, and optionally writes<experiment_key_prefix>:<experiment-id>(--no-experiment-keyto skip). Supports--dry-run,--dataset-id, and writes the merged payload to disk (evals-ui-data.json). Defaults: projectstagehand, dataset idstagehand-evals, latest keystagehand:evals:latest, experiment prefixstagehand:evals:experiments.Migration
BRAINTRUST_API_KEYand eitherUPSTASH_REDIS_REST_URL/UPSTASH_REDIS_REST_TOKENorKV_REST_API_URL/KV_REST_API_TOKEN.dry_runor disablewrite_experiment_keyif desired.Written for commit 900b504. Summary will update on new commits.