Skip to content

Add descriptive dataset statistics plots#126

Open
yananlong wants to merge 16 commits into
evaleval:mainfrom
yananlong:descriptive-statistics-python-pr
Open

Add descriptive dataset statistics plots#126
yananlong wants to merge 16 commits into
evaleval:mainfrom
yananlong:descriptive-statistics-python-pr

Conversation

@yananlong
Copy link
Copy Markdown
Contributor

Summary

  • add descriptive dataset statistics helpers for model coverage by benchmark and inference-engine/runtime spread
  • add PDF plots for model-per-dataset coverage and inference-engine spread
  • cover the new descriptive report fields with focused tests
  • keep the plotting script focused on PDF output only

Validation

  • .venv/bin/python -m pytest tests/test_dataset_statistics.py
  • .venv/bin/python scripts/plot_dataset_statistics.py --help

@yananlong yananlong marked this pull request as ready for review April 29, 2026 17:26
Copilot AI review requested due to automatic review settings April 29, 2026 17:26
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copy link
Copy Markdown
Member

@tommasocerruti tommasocerruti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have just one question about how the score summaries are grouped, then this looks good :)

'score_summaries': grouped_summaries(
rows,
'score',
('benchmark', 'evaluation_name'),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these grouped score summaries include a metric identity field too (e.g., metric_id)? The code’s benchmark field comes from source_data.dataset_name in the datastore (as you defined in line 60 of this file), and some evaluations in the datastore report multiple metrics under the same dataset_name + evaluation_name pair. For example, in this arc-agi file, ARC Prize evaluations leaderboard JSON + v1_Semi_Private is used for both score (accuracy) and cost_per_task (cost). I believe grouping only by benchmark + evaluation_name would average those different quantities together, do you agree?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds reasonable. Let me investigate further.

@yananlong yananlong closed this Apr 30, 2026
@yananlong yananlong deleted the descriptive-statistics-python-pr branch April 30, 2026 12:08
@yananlong yananlong restored the descriptive-statistics-python-pr branch April 30, 2026 12:14
@yananlong yananlong deleted the descriptive-statistics-python-pr branch April 30, 2026 12:14
@yananlong yananlong restored the descriptive-statistics-python-pr branch April 30, 2026 12:14
@yananlong yananlong reopened this Apr 30, 2026
@tommasocerruti
Copy link
Copy Markdown
Member

@yananlong are you still working on this, or can I start reviewing it?

@yananlong
Copy link
Copy Markdown
Contributor Author

yananlong commented Apr 30, 2026 via email

@tommasocerruti
Copy link
Copy Markdown
Member

Great, feel free to ping me when you are done.

@nelaturuharsha
Copy link
Copy Markdown
Collaborator

Is this still WIP @yananlong ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants