[Feature] Add text-to-image modality support by felifri · Pull Request #137 · evaleval/every_eval_ever

felifri · 2026-05-18T10:18:05Z

Summary

Three minimal, additive changes let the schema represent text-to-image generation evaluations (FLUX, SDXL, Imagen, ...) alongside LLMs. Everything else (sampler args, image dimensions, hashes, human-rater pools) flows through the existing additional_details escape hatches, which keeps the surface area small.

modality — optional enum (text | text_to_image) on each evaluation_results[] entry and on each instance record. Absent means text for backwards compatibility.
output.media: MediaRef[] — generated artifacts on the instance record. MediaRef is intentionally minimal: required {media_type, uri} plus an additional_details bag for sha256, mime_type, width/height, seed, index, etc. Required when modality == "text_to_image". media_type is enum [image, video, audio] so the same shape extends to future modalities.
evaluation.is_correct widened to boolean | null — set to null when the metric is continuous (FID, CLIPScore, ImageReward, ...).

T2I uses interaction_type: "single_turn" — modality is the orthogonal axis, which keeps interaction_type logic untouched and lets future image-edit/video records slot in via a single enum extension.

Schema version fields are intentionally left at 0.2.2/instance_level_eval_0.2.2 — bumping is a maintainer call, not part of this PR. The Pydantic models are regenerated via the documented datamodel-codegen pipeline (no hand edits to generated files). One new validate_modality_consistency model validator is added through post_codegen.py so the "T2I requires output.media" constraint is enforced at validation time.

Existing converters and their tests are completely untouched.

What it looks like

A worked example is included at tests/data/t2i/geneval_sdxl_example.{json,jsonl} — minimal GenEval/SDXL fixture showing modality: text_to_image, T2I sampler args in generation_config.additional_details, output.media[] with MediaRefs, and is_correct: null.

Test plan

All 134 existing tests still pass
5 new T2I tests cover: fixture validates, missing output.media fails with the modality validator's error, is_correct: null passes, unknown modality (image_edit) is rejected, and records without a modality field continue to validate as before
python -m every_eval_ever validate tests/data/t2i/ exits 0
Both schemas pass jsonschema.Draft7Validator.check_schema
Pydantic types regenerated cleanly via the documented uv run datamodel-codegen ... + uv run python post_codegen.py pipeline

Three minimal, additive changes let the schema represent text-to-image generation evaluations (FLUX, SDXL, Imagen, ...) alongside LLMs. Everything else (sampler args, image dimensions, hashes, rater pools) flows through the existing `additional_details` escape hatches, which keeps the surface area small. Version fields are left at 0.2.2 — bumping is a maintainer decision and not part of this PR. Aggregate schema (every_eval_ever/schemas/eval.schema.json): - New optional `modality` enum (text | text_to_image) on each evaluation_results[] entry. Absent = text. Instance schema (every_eval_ever/schemas/instance_level_eval.schema.json): - New optional `modality` enum (same values). - New `output.media: MediaRef[]` for generated artifacts. MediaRef is intentionally minimal: required {media_type, uri} plus an `additional_details` bag for sha256, mime_type, width/height, seed, index, etc. (`media_type` is enum [image, video, audio] so the same shape extends to future modalities without re-versioning.) - `evaluation.is_correct` widened to boolean|null. T2I records set it to null when the metric is continuous (FID, CLIPScore, ImageReward). T2I uses `interaction_type: single_turn` — modality is the orthogonal axis, which is the key reason this extends cleanly to image-edit / video later (adding a modality enum value, not touching interaction logic). Sampler args (num_inference_steps, guidance_scale, width/height, scheduler, seed) go in `generation_config.additional_details` as string key-values; human-rater pools go in `metric_config.additional_details` until a future PR adds first-class structure for them. Pydantic types regenerated via datamodel-codegen (the documented pipeline). post_codegen.py: - New validate_modality_consistency model_validator on InstanceLevelEvaluationLog: modality == text_to_image requires non-null output with a non-empty media list. - Skip-check fix: the previous blanket "if 'post_codegen.py' in content" skip prevented a second validator from being appended to the same file. Now scopes the check to the specific validator method name so multiple validators can coexist. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

tests/data/t2i/: - geneval_sdxl_example.json: minimal GenEval aggregate for SDXL Base 1.0 showing modality=text_to_image, a continuous VQA-style score (geneval.overall, metric_kind=vqa_score), and T2I sampler args (num_inference_steps, guidance_scale, width/height, scheduler, seed) living in generation_config.additional_details as string key-values. - geneval_sdxl_example_samples.jsonl: one per-sample line showing output.media[] with two MediaRefs (additional_details holding sha256, width/height, seed, index), empty answer_attribution, is_correct=null. tests/test_validate.py — new TestT2I class with 5 cases: - geneval fixture passes - modality=text_to_image without output.media fails - modality=text_to_image with null is_correct passes - unknown modality value (e.g. image_edit) is rejected - existing records without a modality field still validate Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds a short subsection (12 lines) after Agentic Evaluations explaining the three minimal T2I additions (modality, output.media, is_correct widening), the convention of using additional_details for sampler args and rater pools, and pointing to the tests/data/t2i/ fixture for a worked example. No schema-version mentions are altered. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Erotemic · 2026-05-20T16:22:07Z

FYI: the link is broken.

Erotemic

I think we need a bit more discussion about what to do here. Not exactly sure what changes should be scoped here. I could also be overcomplicating things, so push back if I am.

Erotemic · 2026-05-20T16:42:57Z

            if self.max_score is None:
                raise ValueError("score_type 'continuous' requires max_score")
        return self
+""",


Probably not an issue with this PR, but python code in a string can be a smell. I don't quite understand the purpose of this file yet, but we may want to revisit the design that requires this ATM.

Erotemic · 2026-05-20T16:44:36Z

+# ===================================================================
+
+
+T2I_FIXTURE_DIR = Path(__file__).parent / 'data' / 't2i'


Not a blocker, but __file__ is almost never a robust way to handle finding resources. We should think about factoring out a EEE data repo that properly packages and provides access to the data needed for tests.

Erotemic · 2026-05-20T16:47:04Z

    # --- validators (added by post_codegen.py) ---

-    @model_validator(mode='after')
+    @model_validator(mode="after")


We should just use ruff format and settle on a quote style (I like single quotes) to avoid diffs like these. Not a blocker here, but something that should happen soon as the repo grows.

Erotemic · 2026-05-20T16:53:44Z

                    "description": "Reasoning traces of the model if applicable (e.g. chain-of-thought tokens)",
                    "items": { "type": "string" }
+                },
+                "media": {


I think this doesn't correspond with the pydantic validator. I'm not sure if it should. If it did it would get a bit wordy, i.e.

{ "if": { "required": ["modality"], "properties": { "modality": { "const": "text_to_image" } } }, "then": { "required": ["output"], "properties": { "output": { "type": "object", "required": ["media"], "properties": { "media": { "type": "array", "minItems": 1 } } } } } }

The larger issue is having two sources of truth for the schema. Does it make sense to go all in on pydantic and then have it generate the jsonschema?

Erotemic · 2026-05-20T16:58:20Z

+                },
+                "uri": {
+                    "type": "string",
+                    "description": "Location of the artifact: 'file://...', 'https://...', 'hf://...', 's3://...', or 'data:...;base64,...' for inline."


I think the data:... option is a BAD idea here. This overscopes what this is. This should be a always be a reference to the data, not the data itself. If you want to support embedding the data itself it should be a separate optional field.

I think one URI is also a bad idea. There should be a list of suggested ways to access the data . Otherwise you run into a dead URL issue. It happens too often, and it limits the reproducibility value. URLs rot, they can be changed to point to something else (where the hash is important). I would also suggest thinking about allowing distributed content addressed references like IPFS CIDs or BitTorrent magnet URIs here, which avoid the problem where the content at an address can change, but do not address the link-rot issue (which is just always going to be a fundamental limitation of any reference based scheme).

Erotemic · 2026-05-20T16:59:35Z

    score: float = Field(..., description='Instance-level score')
-    is_correct: bool = Field(
-        ..., description='Whether the final answer is correct'
+    is_correct: bool | None = Field(


Is there a better way that this can be expressed that generalizes better? I don't have an immediate idea, but this seems like the stat of a boolean explosion to me.

Erotemic · 2026-05-20T16:59:55Z

+        extra='forbid',
+    )
+    media_type: MediaType
+    uri: str = Field(


Similar URI comment

felifri and others added 3 commits May 18, 2026 10:15

nelaturuharsha assigned nelaturuharsha and unassigned nelaturuharsha May 18, 2026

nelaturuharsha requested review from Erotemic and damian1996 May 18, 2026 17:54

Erotemic reviewed May 20, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Add text-to-image modality support#137

[Feature] Add text-to-image modality support#137
felifri wants to merge 3 commits into
evaleval:mainfrom
felifri:feature/t2i-schema-extension

felifri commented May 18, 2026 •

edited

Loading

Uh oh!

Erotemic commented May 20, 2026

Uh oh!

Erotemic left a comment

Uh oh!

Erotemic May 20, 2026

Uh oh!

Erotemic May 20, 2026

Uh oh!

Erotemic May 20, 2026

Uh oh!

Erotemic May 20, 2026

Uh oh!

Erotemic May 20, 2026

Uh oh!

Erotemic May 20, 2026

Uh oh!

Erotemic May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		# ===================================================================


		T2I_FIXTURE_DIR = Path(__file__).parent / 'data' / 't2i'

Conversation

felifri commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What it looks like

Test plan

Uh oh!

Erotemic commented May 20, 2026

Uh oh!

Erotemic left a comment

Choose a reason for hiding this comment

Uh oh!

Erotemic May 20, 2026

Choose a reason for hiding this comment

Uh oh!

Erotemic May 20, 2026

Choose a reason for hiding this comment

Uh oh!

Erotemic May 20, 2026

Choose a reason for hiding this comment

Uh oh!

Erotemic May 20, 2026

Choose a reason for hiding this comment

Uh oh!

Erotemic May 20, 2026

Choose a reason for hiding this comment

Uh oh!

Erotemic May 20, 2026

Choose a reason for hiding this comment

Uh oh!

Erotemic May 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

felifri commented May 18, 2026 •

edited

Loading