Skip to content

Fix LLM Stats evaluator provenance#136

Open
tommasocerruti wants to merge 1 commit into
evaleval:mainfrom
tommasocerruti:llm-stats-provenance-fix
Open

Fix LLM Stats evaluator provenance#136
tommasocerruti wants to merge 1 commit into
evaleval:mainfrom
tommasocerruti:llm-stats-provenance-fix

Conversation

@tommasocerruti
Copy link
Copy Markdown
Member

@tommasocerruti tommasocerruti commented May 16, 2026

Fixes #119 for the llm-stats adapter.

This updates the adapter so evaluator_relationship is inferred from the relationship between the underlying score source/evaluator and the model developer, rather than from LLM Stats as the aggregator.

Paired datastore PR: https://huggingface.co/datasets/evaleval/EEE_datastore/discussions/137

Behavior

  • self_reported=true or source organization matching the model developer => first_party
  • self_reported=false or source organization differing from the model developer => third_party
  • no usable provenance signal => other
  • LLM Stats remains the aggregator in source_metadata
  • raw provenance fields are preserved in score_details.details

Example Data Differences

Third-party example: MiniMax M2.7 on GDPVal-AA

Current datastore:
https://huggingface.co/datasets/evaleval/EEE_datastore/raw/main/data/llm-stats/minimax/minimax-m2.7/fa480e85-428c-473d-8c8c-222e74f66155.json

New datastore PR file:
https://huggingface.co/datasets/evaleval/EEE_datastore/raw/refs%2Fpr%2F137/data/llm-stats/minimax/minimax-m2.7/7b294459-ebb0-4ff9-b64f-597d94ce2a9d.json

Before:

  • evaluator_relationship = other
  • no Artificial Analysis source provenance

After:

  • evaluator_relationship = third_party
  • preserves raw_self_reported=false
  • preserves raw_self_reported_source=https://artificialanalysis.ai/evaluations/gdpval-aa
  • infers raw_source_organization=artificial-analysis

First-party example: MiniMax M2.7 self-reported scores

Current datastore:
https://huggingface.co/datasets/evaleval/EEE_datastore/raw/main/data/llm-stats/minimax/minimax-m2.7/c035d7f9-f489-4f53-a044-f796cee1471b.json

New datastore PR file:
https://huggingface.co/datasets/evaleval/EEE_datastore/raw/refs%2Fpr%2F137/data/llm-stats/minimax/minimax-m2.7/46d7bc7e-87e3-4bab-8dc5-8f92603eeb16.json

Before:

  • evaluator_relationship = first_party
  • no original MiniMax source URL preserved

After:

  • still evaluator_relationship = first_party
  • preserves raw_self_reported=true
  • preserves raw_self_reported_source=https://www.minimax.io/models/text/m27
  • infers raw_source_organization=minimax

Validation

uv run pytest tests/test_llm_stats_adapter.py
uv run ruff check utils/llm_stats tests/test_llm_stats_adapter.py
uv run python -m every_eval_ever validate /tmp/eee-llm-stats-pr-output/data/llm-stats

Validation result:

  • adapter tests passed
  • ruff passed
  • regenerated datastore files passed EEE schema validation
  • HF datastore PR validation passed: 269/269 files
  • no data/llm-stats/unknown directory

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Inconsistency in evaluator_relationship

1 participant