[Feature] Add text-to-image modality support#137
Conversation
Three minimal, additive changes let the schema represent text-to-image
generation evaluations (FLUX, SDXL, Imagen, ...) alongside LLMs. Everything
else (sampler args, image dimensions, hashes, rater pools) flows through
the existing `additional_details` escape hatches, which keeps the surface
area small. Version fields are left at 0.2.2 — bumping is a maintainer
decision and not part of this PR.
Aggregate schema (every_eval_ever/schemas/eval.schema.json):
- New optional `modality` enum (text | text_to_image) on each
evaluation_results[] entry. Absent = text.
Instance schema (every_eval_ever/schemas/instance_level_eval.schema.json):
- New optional `modality` enum (same values).
- New `output.media: MediaRef[]` for generated artifacts. MediaRef is
intentionally minimal: required {media_type, uri} plus an
`additional_details` bag for sha256, mime_type, width/height, seed,
index, etc. (`media_type` is enum [image, video, audio] so the same
shape extends to future modalities without re-versioning.)
- `evaluation.is_correct` widened to boolean|null. T2I records set it
to null when the metric is continuous (FID, CLIPScore, ImageReward).
T2I uses `interaction_type: single_turn` — modality is the orthogonal
axis, which is the key reason this extends cleanly to image-edit / video
later (adding a modality enum value, not touching interaction logic).
Sampler args (num_inference_steps, guidance_scale, width/height, scheduler,
seed) go in `generation_config.additional_details` as string key-values;
human-rater pools go in `metric_config.additional_details` until a future
PR adds first-class structure for them.
Pydantic types regenerated via datamodel-codegen (the documented pipeline).
post_codegen.py:
- New validate_modality_consistency model_validator on
InstanceLevelEvaluationLog: modality == text_to_image requires
non-null output with a non-empty media list.
- Skip-check fix: the previous blanket "if 'post_codegen.py' in
content" skip prevented a second validator from being appended to
the same file. Now scopes the check to the specific validator method
name so multiple validators can coexist.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
tests/data/t2i/:
- geneval_sdxl_example.json: minimal GenEval aggregate for SDXL Base 1.0
showing modality=text_to_image, a continuous VQA-style score
(geneval.overall, metric_kind=vqa_score), and T2I sampler args
(num_inference_steps, guidance_scale, width/height, scheduler, seed)
living in generation_config.additional_details as string key-values.
- geneval_sdxl_example_samples.jsonl: one per-sample line showing
output.media[] with two MediaRefs (additional_details holding sha256,
width/height, seed, index), empty answer_attribution, is_correct=null.
tests/test_validate.py — new TestT2I class with 5 cases:
- geneval fixture passes
- modality=text_to_image without output.media fails
- modality=text_to_image with null is_correct passes
- unknown modality value (e.g. image_edit) is rejected
- existing records without a modality field still validate
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a short subsection (12 lines) after Agentic Evaluations explaining the three minimal T2I additions (modality, output.media, is_correct widening), the convention of using additional_details for sampler args and rater pools, and pointing to the tests/data/t2i/ fixture for a worked example. No schema-version mentions are altered. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
FYI: the link is broken. |
Erotemic
left a comment
There was a problem hiding this comment.
I think we need a bit more discussion about what to do here. Not exactly sure what changes should be scoped here. I could also be overcomplicating things, so push back if I am.
| if self.max_score is None: | ||
| raise ValueError("score_type 'continuous' requires max_score") | ||
| return self | ||
| """, |
There was a problem hiding this comment.
Probably not an issue with this PR, but python code in a string can be a smell. I don't quite understand the purpose of this file yet, but we may want to revisit the design that requires this ATM.
| # =================================================================== | ||
|
|
||
|
|
||
| T2I_FIXTURE_DIR = Path(__file__).parent / 'data' / 't2i' |
There was a problem hiding this comment.
Not a blocker, but __file__ is almost never a robust way to handle finding resources. We should think about factoring out a EEE data repo that properly packages and provides access to the data needed for tests.
| # --- validators (added by post_codegen.py) --- | ||
|
|
||
| @model_validator(mode='after') | ||
| @model_validator(mode="after") |
There was a problem hiding this comment.
We should just use ruff format and settle on a quote style (I like single quotes) to avoid diffs like these. Not a blocker here, but something that should happen soon as the repo grows.
| "description": "Reasoning traces of the model if applicable (e.g. chain-of-thought tokens)", | ||
| "items": { "type": "string" } | ||
| }, | ||
| "media": { |
There was a problem hiding this comment.
I think this doesn't correspond with the pydantic validator. I'm not sure if it should. If it did it would get a bit wordy, i.e.
{
"if": {
"required": ["modality"],
"properties": {
"modality": { "const": "text_to_image" }
}
},
"then": {
"required": ["output"],
"properties": {
"output": {
"type": "object",
"required": ["media"],
"properties": {
"media": {
"type": "array",
"minItems": 1
}
}
}
}
}
}
The larger issue is having two sources of truth for the schema. Does it make sense to go all in on pydantic and then have it generate the jsonschema?
| }, | ||
| "uri": { | ||
| "type": "string", | ||
| "description": "Location of the artifact: 'file://...', 'https://...', 'hf://...', 's3://...', or 'data:...;base64,...' for inline." |
There was a problem hiding this comment.
I think the data:... option is a BAD idea here. This overscopes what this is. This should be a always be a reference to the data, not the data itself. If you want to support embedding the data itself it should be a separate optional field.
I think one URI is also a bad idea. There should be a list of suggested ways to access the data . Otherwise you run into a dead URL issue. It happens too often, and it limits the reproducibility value. URLs rot, they can be changed to point to something else (where the hash is important). I would also suggest thinking about allowing distributed content addressed references like IPFS CIDs or BitTorrent magnet URIs here, which avoid the problem where the content at an address can change, but do not address the link-rot issue (which is just always going to be a fundamental limitation of any reference based scheme).
| score: float = Field(..., description='Instance-level score') | ||
| is_correct: bool = Field( | ||
| ..., description='Whether the final answer is correct' | ||
| is_correct: bool | None = Field( |
There was a problem hiding this comment.
Is there a better way that this can be expressed that generalizes better? I don't have an immediate idea, but this seems like the stat of a boolean explosion to me.
| extra='forbid', | ||
| ) | ||
| media_type: MediaType | ||
| uri: str = Field( |
Summary
Three minimal, additive changes let the schema represent text-to-image generation evaluations (FLUX, SDXL, Imagen, ...) alongside LLMs. Everything else (sampler args, image dimensions, hashes, human-rater pools) flows through the existing
additional_detailsescape hatches, which keeps the surface area small.modality— optional enum (text|text_to_image) on eachevaluation_results[]entry and on each instance record. Absent meanstextfor backwards compatibility.output.media: MediaRef[]— generated artifacts on the instance record.MediaRefis intentionally minimal: required{media_type, uri}plus anadditional_detailsbag for sha256, mime_type, width/height, seed, index, etc. Required whenmodality == "text_to_image".media_typeis enum[image, video, audio]so the same shape extends to future modalities.evaluation.is_correctwidened toboolean | null— set tonullwhen the metric is continuous (FID, CLIPScore, ImageReward, ...).T2I uses
interaction_type: "single_turn"— modality is the orthogonal axis, which keepsinteraction_typelogic untouched and lets future image-edit/video records slot in via a single enum extension.Schema
versionfields are intentionally left at0.2.2/instance_level_eval_0.2.2— bumping is a maintainer call, not part of this PR. The Pydantic models are regenerated via the documenteddatamodel-codegenpipeline (no hand edits to generated files). One newvalidate_modality_consistencymodel validator is added throughpost_codegen.pyso the "T2I requiresoutput.media" constraint is enforced at validation time.Existing converters and their tests are completely untouched.
What it looks like
A worked example is included at
tests/data/t2i/geneval_sdxl_example.{json,jsonl}— minimal GenEval/SDXL fixture showingmodality: text_to_image, T2I sampler args ingeneration_config.additional_details,output.media[]withMediaRefs, andis_correct: null.Test plan
output.mediafails with the modality validator's error,is_correct: nullpasses, unknown modality (image_edit) is rejected, and records without amodalityfield continue to validate as beforepython -m every_eval_ever validate tests/data/t2i/exits 0jsonschema.Draft7Validator.check_schemauv run datamodel-codegen ...+uv run python post_codegen.pypipeline