feat(0.31.0): JudgeScoresRecord on RunRecord.outcome by tangletools · Pull Request #66 · tangle-network/agent-eval

tangletools · 2026-05-20T14:44:15Z

The gap this closes

agent-builder PR #179 just landed a forge-chat judge rubric that
computes three per-dim scores (helpfulness, clarity, on_topic) per
cell, but only persists the composite to records.jsonl because
RunRecord.outcome doesn't have a per-judge / per-dim slot. The same
problem hits every consumer that wants ensemble scoring across the
five product agents (tax / creative / legal / gtm / agent-builder).

Consumers were either dropping the breakdown on the floor or
smuggling it through stringly-typed outcome.raw keys like
judge_kimi_helpfulness — neither survives a corpus-IRR run.
corpusInterRaterAgreement (0.27.2) expects structured per-judge
per-dim records, not parsed strings.

What this ships

JudgeScoresRecord type (src/run-record.ts):
- perJudge[judgeId][dim]: number — canonical store
- perDimMean[dim]: number — convenience projection
- composite: number — mirrors the score the gate sees
- failedJudges?: string[] — explicit dead-judge ids
- notes?: string — panel prose
RunOutcome.judgeScores?: JudgeScoresRecord — additive on the
outcome; existing single-judge runs leave it unset.
CampaignRunOutcome.judgeScores? wired through runEvalCampaign
so per-cell ensemble outcomes land on RunRecord.outcome.judgeScores
unchanged.
Validator extended in validateRunRecord: per-judge / per-dim /
composite scores must be finite (no silent NaN-as-zero);
failedJudges entries must be non-empty strings.
Tests in tests/run-record.test.ts and tests/eval-campaign.test.ts
cover all four shapes (full, partial with failedJudges, missing,
with notes) plus a fail-loud case where one judge throws and the
record carries the dead-judge id, not a silent zero.
Consumer contract (tests/consumer-contract.test.ts) pins
JudgeScoresRecord as a type-level export so consumer code stops
compiling if the field gets renamed.

Design tradeoffs

perDimMean and composite are precomputed projections of
perJudge. Storing both costs a few bytes per record but spares
every reporter and IRR primitive a re-aggregation; the trade is on
the right side for the read-heavy access pattern.
failedJudges?: string[] is the typed-outcome answer to partial
failures. Missing keys in perJudge would be ambiguous (silent
zero vs not run); the explicit list is fail-loud.
Field is optional on RunOutcome so the 0.30.0 surface is
preserved. Scalar-only runs leave it unset.

What consumers gain

Forge-chat (agent-builder) stops dropping per-dim scores;
corpusInterRaterAgreement consumes the records directly.
Tax / creative / legal / gtm agents inherit the same typed slot
without each implementing their own conventions on outcome.raw.

Version bumps (lockstep)

package.json 0.30.0 → 0.31.0
clients/python/pyproject.toml 0.30.0 → 0.31.0
clients/python/src/agent_eval_rpc/__init__.py 0.30.0 → 0.31.0

Test plan

pnpm typecheck clean
pnpm test — 1208 tests pass (5 new in eval-campaign, 5 new in run-record, 1 new in consumer-contract)
pnpm build clean (tsup + openapi)
After merge: human tags v0.31.0 and pushes to fire the publish workflow.

Ensemble-judge consumers were dropping per-judge per-dim scores on the floor because RunOutcome only had a slot for the composite. Adds a typed `judgeScores?: JudgeScoresRecord` field, threaded through runEvalCampaign and pinned in the consumer-contract test. Validator rejects NaN scores and non-string failedJudges entries; fail-loud test covers a panel where one judge throws. Bumps TS + Python clients to 0.31.0 in lockstep.

tangletools merged commit 51f6e74 into main May 20, 2026
1 check passed

tangletools deleted the feat/0.31.0-judge-scores-record branch May 20, 2026 14:48

tangletools mentioned this pull request May 20, 2026

chore(0.31.1): republish — fix stale dist on v0.31.0 npm artifact #68

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(0.31.0): JudgeScoresRecord on RunRecord.outcome#66

feat(0.31.0): JudgeScoresRecord on RunRecord.outcome#66
tangletools merged 1 commit into
mainfrom
feat/0.31.0-judge-scores-record

tangletools commented May 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tangletools commented May 20, 2026

The gap this closes

What this ships

Design tradeoffs

What consumers gain

Version bumps (lockstep)

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants