feat(leaderboard): add (cov.) scoring-scope toggle by malteos · Pull Request #5 · commoncrawl/commonlid-eval

malteos · 2026-05-22T11:33:40Z

Summary

Add a radio between the dataset metadata and the table per tab. Default = paper-headline view; (cov.) restricts macro/micro/FPR/Languages to gold samples whose language is in the model's declared support set, mirroring the CommonLID paper's (cov.) column.
Persist supported_languages in summary.json (schema v3): sorted ISO 639-3 list when the model can enumerate, JSON null for LLM-style models whose support set is undefined.
New standalone script scripts/backfill_supported_languages.py walks existing summary files and writes the field in place (used to populate the HF dataset without re-running any evals — companion PR opens on commoncrawl/commonlid-results).

Cov view re-ranks the leaderboard because models that intentionally cover a subset are no longer penalised for the long tail:

model	macro F1 (all)	macro F1 (cov.)	cov. languages
cld2	49.5	79.3	68
fasttext	49.3	72.6	74
GlotLID	60.4	68.6	96
OpenLID-v2	47.4	68.0	76
cld3	34.3	66.6	56
commonlingua	57.3	66.4	94
funlangid	46.4	58.1	87
pyfranc	39.3	57.2	58
AfroLID	9.2	43.5	23

supported_languages follows the same tri-state convention as LIDModel.discover_supported_languages(): None (undefined / LLM), [] (declared zero), or list[str]. The leaderboard collapses all three "no cov data" cases to em-dashes in the cov view; the persisted distinction between null and missing is preserved on disk.

Coverage counts vs the paper

We compared the supported_languages lengths produced by discover_supported_languages() against both Table 1 (tab:eval_models) and the mutual-coverage diagonal in tab:coverage. Note the paper's two tables disagree with each other for several models, because Table 1 reports the raw upstream label count while the coverage diagonal reports the dedup'd / ISO-conformed count actually used in evaluation.

Model	Paper Table 1	Coverage diag (`tab:coverage`)	Ours	Δ vs Table 1	Δ vs coverage
GlotLID	1868	1868	1868	0	0
OpenLID-v2	193	193	193	0	0
AfroLID	517	515	516	−1	+1
cld3	99	99	101	+2	+2
fasttext	218	210	210	−8	0
FUN-LangID	1634	1549	1552	−82	+3
CLD2	158	158	172	+14	+14
pyFranc	414	410	379	−35	−31

Per-model diagnosis:

GlotLID / OpenLID-v2: exact match.
AfroLID / fasttext / FUN-LangID / cld3: ours equals the coverage diagonal within ±3. Table 1 reports the raw label count (518, 218, 1634, 107 raw → 516 / 210 / 1552 / 101 after _conform()), the coverage diagonal does what we do.
CLD2 (+14): pycld2 ships 303 raw labels; after dropping xx-Script sentinels (xx-Cyrl, xx-Hani, …) and BCP-47 region variants (sr-ME) we get 172 unique ISO 639-3 codes. The paper's 158 is more conservative — it likely filtered some BCP-47 / script tags whose _conform() output is a real ISO 639-3 code (e.g. zh-Hant → zho).
pyFranc (−31): upstream pyfranc.franc.data currently exposes 380 unique raw codes; we drop only the 1 duplicate (dan collision). The paper's 410 / 414 reflects an older snapshot of franc — the library shrunk between the paper's submission and now. Not actionable on our side without pinning an older pyfranc.

Recommendation: leave the numbers as-is. discover_supported_languages() is internally consistent (every model goes through the same _conform() filter) and reflects the currently-installed library version — so the cov metric is computed against what each model can actually predict today, not against a paper-table snapshot. Overriding to literally match the paper would mean reporting a coverage set that diverges from what _conform()'d predictions can hit at eval time.

Test plan

make lint && make typecheck — clean
make test — 247 passed, 94.56% coverage. New tests cover _row_from_summary cov math, the three "no cov data" cases, the radio change handler, and Result.summary() round-tripping None / [] / list.
Backfill local data/results/, restart make leaderboard, switch the radio: table + legend swap in lockstep, em-dashes for GPT-* rows, sorted-to-bottom in cov view.
Follow-up: open the matching PR on commoncrawl/commonlid-results so the deployed Space picks up supported_languages on next restart.

Add a radio between the dataset metadata and the results table per tab. "All samples" (default) preserves the current paper-headline view; "(cov.)" restricts the macro/micro/FPR/Languages columns to gold samples whose language is in the model's declared support set, matching the CommonLID paper's `(cov.)` column. Mechanics: persist `supported_languages` in `summary.json` (schema v3) — sorted ISO 639-3 list when the model can enumerate, JSON `null` for LLM-style models whose support set is undefined. The leaderboard data layer reuses `mean_stats_with_coverage` + `mean_false_positive_rate` to compute the cov fields per row; rows without a support set render em-dashes and sort to the bottom of the cov view. Backfill existing summary files via `scripts/backfill_supported_languages.py` (standalone, argparse) and publish them to the HF dataset to keep the deployed Space in sync.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(leaderboard): add (cov.) scoring-scope toggle#5

feat(leaderboard): add (cov.) scoring-scope toggle#5
malteos wants to merge 1 commit into
mainfrom
feat/leaderboard-cov

malteos commented May 22, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

malteos commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Coverage counts vs the paper

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

malteos commented May 22, 2026 •

edited

Loading