fix: Use raw AA benchmark accuracy instead of composite #43

yuvalluria · 2025-12-23T14:56:10Z

Replace model_evaluator.score_model() composite scoring with direct AA benchmark scores from usecase_quality_scorer. The composite score incorrectly favored smaller models due to latency/budget bonuses.

Changes:

Get raw accuracy from score_model_quality() in capacity_planner
GPT-OSS 120B now correctly shows ~62% (was showing lower)
GPT-OSS 20B now correctly shows ~55% (was showing higher)

Assisted-by: Claude noreply@anthropic.com

Replace model_evaluator.score_model() composite scoring with direct AA benchmark scores from usecase_quality_scorer. The composite score incorrectly favored smaller models due to latency/budget bonuses. Changes: - Get raw accuracy from score_model_quality() in capacity_planner - GPT-OSS 120B now correctly shows ~62% (was showing lower) - GPT-OSS 20B now correctly shows ~55% (was showing higher) Assisted-by: Claude <noreply@anthropic.com> Signed-off-by: Yuval Luria <yuval750@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: Use raw AA benchmark accuracy instead of composite #43

fix: Use raw AA benchmark accuracy instead of composite #43

Uh oh!

yuvalluria commented Dec 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

fix: Use raw AA benchmark accuracy instead of composite #43

Are you sure you want to change the base?

fix: Use raw AA benchmark accuracy instead of composite #43

Uh oh!

Conversation

yuvalluria commented Dec 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant