Add descriptive dataset statistics plots by yananlong · Pull Request #126 · evaleval/every_eval_ever

yananlong · 2026-04-29T17:14:38Z

Summary

add descriptive dataset statistics helpers for model coverage by benchmark and inference-engine/runtime spread
add PDF plots for model-per-dataset coverage and inference-engine spread
cover the new descriptive report fields with focused tests
keep the plotting script focused on PDF output only

Validation

.venv/bin/python -m pytest tests/test_dataset_statistics.py
.venv/bin/python scripts/plot_dataset_statistics.py --help

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

tommasocerruti

I have just one question about how the score summaries are grouped, then this looks good :)

tommasocerruti · 2026-04-29T18:57:12Z

+        'score_summaries': grouped_summaries(
+            rows,
+            'score',
+            ('benchmark', 'evaluation_name'),


Should these grouped score summaries include a metric identity field too (e.g., metric_id)? The code’s benchmark field comes from source_data.dataset_name in the datastore (as you defined in line 60 of this file), and some evaluations in the datastore report multiple metrics under the same dataset_name + evaluation_name pair. For example, in this arc-agi file, ARC Prize evaluations leaderboard JSON + v1_Semi_Private is used for both score (accuracy) and cost_per_task (cost). I believe grouping only by benchmark + evaluation_name would average those different quantities together, do you agree?

This sounds reasonable. Let me investigate further.

…/yananlong/every_eval_ever into descriptive-statistics-python-pr

This reverts commit f0e5dcd.

tommasocerruti · 2026-04-30T14:03:36Z

@yananlong are you still working on this, or can I start reviewing it?

yananlong · 2026-04-30T14:08:21Z

Still working. I am making different plots now. More on this later today.

tommasocerruti · 2026-04-30T14:24:01Z

Great, feel free to ping me when you are done.

nelaturuharsha · 2026-05-13T19:41:05Z

Is this still WIP @yananlong ?

yananlong added 6 commits April 29, 2026 13:20

add stats helper

5f76fba

Add dataset statistics plotting script

3dcf199

Add dataset model and runtime coverage plots

5a0858c

Remove generated dataset summary writer

c78b855

Use log scale for inference engine plot

7277f21

Fix top coverage plot ordering

f0e5dcd

yananlong marked this pull request as ready for review April 29, 2026 17:26

Copilot AI review requested due to automatic review settings April 29, 2026 17:26

Copilot started reviewing on behalf of yananlong April 29, 2026 17:27 View session

Copilot AI reviewed Apr 29, 2026

View reviewed changes

tommasocerruti reviewed Apr 29, 2026

View reviewed changes

yananlong closed this Apr 30, 2026

yananlong deleted the descriptive-statistics-python-pr branch April 30, 2026 12:08

yananlong restored the descriptive-statistics-python-pr branch April 30, 2026 12:14

yananlong deleted the descriptive-statistics-python-pr branch April 30, 2026 12:14

yananlong restored the descriptive-statistics-python-pr branch April 30, 2026 12:14

yananlong reopened this Apr 30, 2026

yananlong added 7 commits April 30, 2026 09:45

Group score summaries by metric identity

b460da8

minor changes

8c2c578

Merge branch 'descriptive-statistics-python-pr' of https://github.com…

e83156f

…/yananlong/every_eval_ever into descriptive-statistics-python-pr

Revert "Fix top coverage plot ordering"

6f62905

This reverts commit f0e5dcd.

Ignore audit and plan directories

d0b09e1

Remove tracked audit artifacts

ef7ce1f

new 2-panel plots

0426fd2

yananlong and others added 3 commits April 30, 2026 13:41

include computed stats

06d9188

add 3rd figure

cdc7d8c

Delete audit/dataset_statistics.json

c86aa2e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add descriptive dataset statistics plots#126

Add descriptive dataset statistics plots#126
yananlong wants to merge 16 commits into
evaleval:mainfrom
yananlong:descriptive-statistics-python-pr

yananlong commented Apr 29, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

tommasocerruti left a comment

Uh oh!

tommasocerruti Apr 29, 2026

Uh oh!

yananlong Apr 30, 2026

Uh oh!

tommasocerruti commented Apr 30, 2026

Uh oh!

yananlong commented Apr 30, 2026 via email •

edited

Loading

Uh oh!

tommasocerruti commented Apr 30, 2026

Uh oh!

nelaturuharsha commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

yananlong commented Apr 29, 2026

Summary

Validation

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

tommasocerruti left a comment

Choose a reason for hiding this comment

Uh oh!

tommasocerruti Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

yananlong Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

tommasocerruti commented Apr 30, 2026

Uh oh!

yananlong commented Apr 30, 2026 via email • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tommasocerruti commented Apr 30, 2026

Uh oh!

nelaturuharsha commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

yananlong commented Apr 30, 2026 via email •

edited

Loading