Skip to content

Conversation

@yuvalluria
Copy link
Collaborator

Replace model_evaluator.score_model() composite scoring with direct AA benchmark scores from usecase_quality_scorer. The composite score incorrectly favored smaller models due to latency/budget bonuses.

Changes:

  • Get raw accuracy from score_model_quality() in capacity_planner
  • GPT-OSS 120B now correctly shows ~62% (was showing lower)
  • GPT-OSS 20B now correctly shows ~55% (was showing higher)

Assisted-by: Claude noreply@anthropic.com

Replace model_evaluator.score_model() composite scoring with direct
AA benchmark scores from usecase_quality_scorer. The composite score
incorrectly favored smaller models due to latency/budget bonuses.

Changes:
- Get raw accuracy from score_model_quality() in capacity_planner
- GPT-OSS 120B now correctly shows ~62% (was showing lower)
- GPT-OSS 20B now correctly shows ~55% (was showing higher)

Assisted-by: Claude <noreply@anthropic.com>
Signed-off-by: Yuval Luria <yuval750@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant