feat(0.31.0): JudgeScoresRecord on RunRecord.outcome#66
Merged
Conversation
Ensemble-judge consumers were dropping per-judge per-dim scores on the floor because RunOutcome only had a slot for the composite. Adds a typed `judgeScores?: JudgeScoresRecord` field, threaded through runEvalCampaign and pinned in the consumer-contract test. Validator rejects NaN scores and non-string failedJudges entries; fail-loud test covers a panel where one judge throws. Bumps TS + Python clients to 0.31.0 in lockstep.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The gap this closes
agent-builder PR #179 just landed a forge-chat judge rubric that
computes three per-dim scores (helpfulness, clarity, on_topic) per
cell, but only persists the composite to
records.jsonlbecauseRunRecord.outcomedoesn't have a per-judge / per-dim slot. The sameproblem hits every consumer that wants ensemble scoring across the
five product agents (tax / creative / legal / gtm / agent-builder).
Consumers were either dropping the breakdown on the floor or
smuggling it through stringly-typed
outcome.rawkeys likejudge_kimi_helpfulness— neither survives a corpus-IRR run.corpusInterRaterAgreement(0.27.2) expects structured per-judgeper-dim records, not parsed strings.
What this ships
JudgeScoresRecordtype (src/run-record.ts):perJudge[judgeId][dim]: number— canonical storeperDimMean[dim]: number— convenience projectioncomposite: number— mirrors the score the gate seesfailedJudges?: string[]— explicit dead-judge idsnotes?: string— panel proseRunOutcome.judgeScores?: JudgeScoresRecord— additive on theoutcome; existing single-judge runs leave it unset.
CampaignRunOutcome.judgeScores?wired throughrunEvalCampaignso per-cell ensemble outcomes land on
RunRecord.outcome.judgeScoresunchanged.
validateRunRecord: per-judge / per-dim /composite scores must be finite (no silent NaN-as-zero);
failedJudgesentries must be non-empty strings.tests/run-record.test.tsandtests/eval-campaign.test.tscover all four shapes (full, partial with
failedJudges, missing,with notes) plus a fail-loud case where one judge throws and the
record carries the dead-judge id, not a silent zero.
tests/consumer-contract.test.ts) pinsJudgeScoresRecordas a type-level export so consumer code stopscompiling if the field gets renamed.
Design tradeoffs
perDimMeanandcompositeare precomputed projections ofperJudge. Storing both costs a few bytes per record but sparesevery reporter and IRR primitive a re-aggregation; the trade is on
the right side for the read-heavy access pattern.
failedJudges?: string[]is the typed-outcome answer to partialfailures. Missing keys in
perJudgewould be ambiguous (silentzero vs not run); the explicit list is fail-loud.
RunOutcomeso the 0.30.0 surface ispreserved. Scalar-only runs leave it unset.
What consumers gain
corpusInterRaterAgreementconsumes the records directly.without each implementing their own conventions on
outcome.raw.Version bumps (lockstep)
package.json0.30.0 → 0.31.0clients/python/pyproject.toml0.30.0 → 0.31.0clients/python/src/agent_eval_rpc/__init__.py0.30.0 → 0.31.0Test plan
pnpm typecheckcleanpnpm test— 1208 tests pass (5 new in eval-campaign, 5 new in run-record, 1 new in consumer-contract)pnpm buildclean (tsup + openapi)v0.31.0and pushes to fire the publish workflow.