feat(leaderboard): add (cov.) scoring-scope toggle#5
Open
malteos wants to merge 1 commit into
Open
Conversation
Add a radio between the dataset metadata and the results table per tab. "All samples" (default) preserves the current paper-headline view; "(cov.)" restricts the macro/micro/FPR/Languages columns to gold samples whose language is in the model's declared support set, matching the CommonLID paper's `(cov.)` column. Mechanics: persist `supported_languages` in `summary.json` (schema v3) — sorted ISO 639-3 list when the model can enumerate, JSON `null` for LLM-style models whose support set is undefined. The leaderboard data layer reuses `mean_stats_with_coverage` + `mean_false_positive_rate` to compute the cov fields per row; rows without a support set render em-dashes and sort to the bottom of the cov view. Backfill existing summary files via `scripts/backfill_supported_languages.py` (standalone, argparse) and publish them to the HF dataset to keep the deployed Space in sync.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
(cov.)column.supported_languagesinsummary.json(schema v3): sorted ISO 639-3 list when the model can enumerate, JSONnullfor LLM-style models whose support set is undefined.scripts/backfill_supported_languages.pywalks existing summary files and writes the field in place (used to populate the HF dataset without re-running any evals — companion PR opens oncommoncrawl/commonlid-results).Cov view re-ranks the leaderboard because models that intentionally cover a subset are no longer penalised for the long tail:
supported_languagesfollows the same tri-state convention asLIDModel.discover_supported_languages():None(undefined / LLM),[](declared zero), orlist[str]. The leaderboard collapses all three "no cov data" cases to em-dashes in the cov view; the persisted distinction betweennulland missing is preserved on disk.Coverage counts vs the paper
We compared the
supported_languageslengths produced bydiscover_supported_languages()against both Table 1 (tab:eval_models) and the mutual-coverage diagonal intab:coverage. Note the paper's two tables disagree with each other for several models, because Table 1 reports the raw upstream label count while the coverage diagonal reports the dedup'd / ISO-conformed count actually used in evaluation.tab:coverage)Per-model diagnosis:
_conform()), the coverage diagonal does what we do.xx-Scriptsentinels (xx-Cyrl,xx-Hani, …) and BCP-47 region variants (sr-ME) we get 172 unique ISO 639-3 codes. The paper's 158 is more conservative — it likely filtered some BCP-47 / script tags whose_conform()output is a real ISO 639-3 code (e.g.zh-Hant→zho).pyfranc.franc.datacurrently exposes 380 unique raw codes; we drop only the 1 duplicate (dancollision). The paper's 410 / 414 reflects an older snapshot of franc — the library shrunk between the paper's submission and now. Not actionable on our side without pinning an olderpyfranc.Recommendation: leave the numbers as-is.
discover_supported_languages()is internally consistent (every model goes through the same_conform()filter) and reflects the currently-installed library version — so the cov metric is computed against what each model can actually predict today, not against a paper-table snapshot. Overriding to literally match the paper would mean reporting a coverage set that diverges from what_conform()'d predictions can hit at eval time.Test plan
make lint && make typecheck— cleanmake test— 247 passed, 94.56% coverage. New tests cover_row_from_summarycov math, the three "no cov data" cases, the radio change handler, andResult.summary()round-trippingNone/[]/list.data/results/, restartmake leaderboard, switch the radio: table + legend swap in lockstep, em-dashes for GPT-* rows, sorted-to-bottom in cov view.commoncrawl/commonlid-resultsso the deployed Space picks upsupported_languageson next restart.