Skip to content

feat(leaderboard): add (cov.) scoring-scope toggle#5

Open
malteos wants to merge 1 commit into
mainfrom
feat/leaderboard-cov
Open

feat(leaderboard): add (cov.) scoring-scope toggle#5
malteos wants to merge 1 commit into
mainfrom
feat/leaderboard-cov

Conversation

@malteos
Copy link
Copy Markdown
Collaborator

@malteos malteos commented May 22, 2026

Summary

  • Add a radio between the dataset metadata and the table per tab. Default = paper-headline view; (cov.) restricts macro/micro/FPR/Languages to gold samples whose language is in the model's declared support set, mirroring the CommonLID paper's (cov.) column.
  • Persist supported_languages in summary.json (schema v3): sorted ISO 639-3 list when the model can enumerate, JSON null for LLM-style models whose support set is undefined.
  • New standalone script scripts/backfill_supported_languages.py walks existing summary files and writes the field in place (used to populate the HF dataset without re-running any evals — companion PR opens on commoncrawl/commonlid-results).

Cov view re-ranks the leaderboard because models that intentionally cover a subset are no longer penalised for the long tail:

model macro F1 (all) macro F1 (cov.) cov. languages
cld2 49.5 79.3 68
fasttext 49.3 72.6 74
GlotLID 60.4 68.6 96
OpenLID-v2 47.4 68.0 76
cld3 34.3 66.6 56
commonlingua 57.3 66.4 94
funlangid 46.4 58.1 87
pyfranc 39.3 57.2 58
AfroLID 9.2 43.5 23

supported_languages follows the same tri-state convention as LIDModel.discover_supported_languages(): None (undefined / LLM), [] (declared zero), or list[str]. The leaderboard collapses all three "no cov data" cases to em-dashes in the cov view; the persisted distinction between null and missing is preserved on disk.

Coverage counts vs the paper

We compared the supported_languages lengths produced by discover_supported_languages() against both Table 1 (tab:eval_models) and the mutual-coverage diagonal in tab:coverage. Note the paper's two tables disagree with each other for several models, because Table 1 reports the raw upstream label count while the coverage diagonal reports the dedup'd / ISO-conformed count actually used in evaluation.

Model Paper Table 1 Coverage diag (tab:coverage) Ours Δ vs Table 1 Δ vs coverage
GlotLID 1868 1868 1868 0 0
OpenLID-v2 193 193 193 0 0
AfroLID 517 515 516 −1 +1
cld3 99 99 101 +2 +2
fasttext 218 210 210 −8 0
FUN-LangID 1634 1549 1552 −82 +3
CLD2 158 158 172 +14 +14
pyFranc 414 410 379 −35 −31

Per-model diagnosis:

  • GlotLID / OpenLID-v2: exact match.
  • AfroLID / fasttext / FUN-LangID / cld3: ours equals the coverage diagonal within ±3. Table 1 reports the raw label count (518, 218, 1634, 107 raw → 516 / 210 / 1552 / 101 after _conform()), the coverage diagonal does what we do.
  • CLD2 (+14): pycld2 ships 303 raw labels; after dropping xx-Script sentinels (xx-Cyrl, xx-Hani, …) and BCP-47 region variants (sr-ME) we get 172 unique ISO 639-3 codes. The paper's 158 is more conservative — it likely filtered some BCP-47 / script tags whose _conform() output is a real ISO 639-3 code (e.g. zh-Hantzho).
  • pyFranc (−31): upstream pyfranc.franc.data currently exposes 380 unique raw codes; we drop only the 1 duplicate (dan collision). The paper's 410 / 414 reflects an older snapshot of franc — the library shrunk between the paper's submission and now. Not actionable on our side without pinning an older pyfranc.

Recommendation: leave the numbers as-is. discover_supported_languages() is internally consistent (every model goes through the same _conform() filter) and reflects the currently-installed library version — so the cov metric is computed against what each model can actually predict today, not against a paper-table snapshot. Overriding to literally match the paper would mean reporting a coverage set that diverges from what _conform()'d predictions can hit at eval time.

Test plan

  • make lint && make typecheck — clean
  • make test — 247 passed, 94.56% coverage. New tests cover _row_from_summary cov math, the three "no cov data" cases, the radio change handler, and Result.summary() round-tripping None / [] / list.
  • Backfill local data/results/, restart make leaderboard, switch the radio: table + legend swap in lockstep, em-dashes for GPT-* rows, sorted-to-bottom in cov view.
  • Follow-up: open the matching PR on commoncrawl/commonlid-results so the deployed Space picks up supported_languages on next restart.

Add a radio between the dataset metadata and the results table per tab.
"All samples" (default) preserves the current paper-headline view; "(cov.)"
restricts the macro/micro/FPR/Languages columns to gold samples whose
language is in the model's declared support set, matching the CommonLID
paper's `(cov.)` column.

Mechanics: persist `supported_languages` in `summary.json` (schema v3) —
sorted ISO 639-3 list when the model can enumerate, JSON `null` for
LLM-style models whose support set is undefined. The leaderboard data
layer reuses `mean_stats_with_coverage` + `mean_false_positive_rate` to
compute the cov fields per row; rows without a support set render
em-dashes and sort to the bottom of the cov view.

Backfill existing summary files via `scripts/backfill_supported_languages.py`
(standalone, argparse) and publish them to the HF dataset to keep the
deployed Space in sync.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant