Fix LLM Stats evaluator provenance by tommasocerruti · Pull Request #136 · evaleval/every_eval_ever

tommasocerruti · 2026-05-16T18:18:51Z

Fixes #119 for the llm-stats adapter.

This updates the adapter so evaluator_relationship is inferred from the relationship between the underlying score source/evaluator and the model developer, rather than from LLM Stats as the aggregator.

Paired datastore PR: https://huggingface.co/datasets/evaleval/EEE_datastore/discussions/137

Behavior

self_reported=true or source organization matching the model developer => first_party
self_reported=false or source organization differing from the model developer => third_party
no usable provenance signal => other
LLM Stats remains the aggregator in source_metadata
raw provenance fields are preserved in score_details.details

Example Data Differences

Third-party example: MiniMax M2.7 on GDPVal-AA

Current datastore:
https://huggingface.co/datasets/evaleval/EEE_datastore/raw/main/data/llm-stats/minimax/minimax-m2.7/fa480e85-428c-473d-8c8c-222e74f66155.json

New datastore PR file:
https://huggingface.co/datasets/evaleval/EEE_datastore/raw/refs%2Fpr%2F137/data/llm-stats/minimax/minimax-m2.7/7b294459-ebb0-4ff9-b64f-597d94ce2a9d.json

Before:

evaluator_relationship = other
no Artificial Analysis source provenance

After:

evaluator_relationship = third_party
preserves raw_self_reported=false
preserves raw_self_reported_source=https://artificialanalysis.ai/evaluations/gdpval-aa
infers raw_source_organization=artificial-analysis

First-party example: MiniMax M2.7 self-reported scores

Current datastore:
https://huggingface.co/datasets/evaleval/EEE_datastore/raw/main/data/llm-stats/minimax/minimax-m2.7/c035d7f9-f489-4f53-a044-f796cee1471b.json

New datastore PR file:
https://huggingface.co/datasets/evaleval/EEE_datastore/raw/refs%2Fpr%2F137/data/llm-stats/minimax/minimax-m2.7/46d7bc7e-87e3-4bab-8dc5-8f92603eeb16.json

Before:

evaluator_relationship = first_party
no original MiniMax source URL preserved

After:

still evaluator_relationship = first_party
preserves raw_self_reported=true
preserves raw_self_reported_source=https://www.minimax.io/models/text/m27
infers raw_source_organization=minimax

Validation

uv run pytest tests/test_llm_stats_adapter.py
uv run ruff check utils/llm_stats tests/test_llm_stats_adapter.py
uv run python -m every_eval_ever validate /tmp/eee-llm-stats-pr-output/data/llm-stats

Validation result:

adapter tests passed
ruff passed
regenerated datastore files passed EEE schema validation
HF datastore PR validation passed: 269/269 files
no data/llm-stats/unknown directory

Fix LLM Stats evaluator provenance

52030cb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix LLM Stats evaluator provenance#136

Fix LLM Stats evaluator provenance#136
tommasocerruti wants to merge 1 commit into
evaleval:mainfrom
tommasocerruti:llm-stats-provenance-fix

tommasocerruti commented May 16, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tommasocerruti commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Behavior

Example Data Differences

Third-party example: MiniMax M2.7 on GDPVal-AA

First-party example: MiniMax M2.7 self-reported scores

Validation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

tommasocerruti commented May 16, 2026 •

edited

Loading