Skip to content

Add objective Overall / Value / Capability score columns to the models table#1892

Open
huncho-tensei wants to merge 1 commit into
anomalyco:devfrom
huncho-tensei:feat/objective-model-scores
Open

Add objective Overall / Value / Capability score columns to the models table#1892
huncho-tensei wants to merge 1 commit into
anomalyco:devfrom
huncho-tensei:feat/objective-model-scores

Conversation

@huncho-tensei
Copy link
Copy Markdown

Summary

Adds three sortable, transparently-computed score columns to the models table — Overall, Value, and Capability — plus a dynamic rank (#) column that renumbers with the current sort.

The goal is to let people rank the whole catalog by objective criteria without leaving the table they already use. Scores are derived entirely from existing catalog fields — no benchmarks, no hand-grading, no external data.

New column layout:

#  |  Provider  |  Model  |  Overall  |  Value  |  Capability  |  … (all existing columns unchanged)

The table defaults to Overall, descending. All three score columns sort like every other column, so users can switch lens with one click.

How the scores are computed

Four normalized 0–100 components are calculated from objective fields, then blended with weights kept in one documented place (packages/web/src/score.ts):

Component Source fields Direction
capability tool_call, reasoning, structured_output, temperature + input/output modality breadth higher = better
cost blended cost.input + cost.output ($/1M), log-scaled then inverted cheaper = better; free = top; unknown = neutral 50
context limit.context + limit.output, log-scaled higher = better
recency release_date newer = better

Each component is min-max normalized across the whole dataset (a missing/unparseable field collapses to a neutral 50, so it never silently wins or loses). The three lenses are just different weightings:

Lens capability cost context recency
Overall 0.40 0.30 0.20 0.10
Value 0.35 0.50 0.10 0.05
Capability 0.60 0.15 0.20 0.05

The weights are the only opinion in the change and are isolated in a single WEIGHTS object so they're trivial to audit or tune.

What the score does — and does NOT — measure

This is important and stated up front: the catalog has no quality/benchmark field, so the score cannot and does not measure model "intelligence." "Capability" here means breadth of declared features and modalities, not how good a model's outputs are.

A direct consequence, visible in the live data: broad, cheap, omni-modal models (and meta/auto-routers, which declare every modality at low listed cost) rank at the very top, above expensive frontier models. That is correct given these inputs — it's a spec-breadth-per-dollar ranking, not a smartness ranking.

This is also the reasoning behind shipping it as sortable columns rather than one decreed ranking: the data stays neutral, and the user chooses the lens that fits their use case. If a future schema ever adds an objective quality signal, it drops straight into the existing component blend.

Scope

  • Web package only. The canonical api.json / core data is unchanged — scoring is a presentation-layer concern and does not pollute the data API.
  • Files touched: score.ts (new), shared.ts, render.tsx, index.ts, index.css.

Validation

  • bun validate passes.
  • cd packages/web && bun run build succeeds; rendered HTML contains all new columns and computed scores across the dataset.
  • Sort invariant verified: 28 sortable headers ↔ 28 sortValues entries.
  • Score distribution sanity-checked (spans ~22–90; image/TTS models correctly rank lowest).

Happy to adjust the weights, drop to two columns (Overall + Value), or gate this behind discussion if a built-in ranking isn't a direction you want — feedback welcome.

Adds three sortable, transparently-computed score columns to the models
table, plus a dynamic rank (#) column that renumbers with the current sort.

Scores are derived entirely from existing objective catalog fields (cost,
context window, output limit, capability flags, modality breadth, release
date) — no benchmarks or hand-grading. Four normalized 0-100 components
(capability, cost-efficiency, context, recency) are blended into three
lenses with weights documented in one place in score.ts:

  - Overall:    well-rounded "best overall"
  - Value:      cost-efficiency weighted (cheap-yet-capable)
  - Capability: feature/modality breadth weighted

The table defaults to Overall (descending). All three columns sort like
any other column, so users can pick the lens that fits their use case.

Scope is web-only; the canonical api.json data is unchanged.
@huncho-tensei
Copy link
Copy Markdown
Author

Rationale and design discussion in #1893.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant