Skip to content

[Feature] Add text-to-image modality support#137

Open
felifri wants to merge 3 commits into
evaleval:mainfrom
felifri:feature/t2i-schema-extension
Open

[Feature] Add text-to-image modality support#137
felifri wants to merge 3 commits into
evaleval:mainfrom
felifri:feature/t2i-schema-extension

Conversation

@felifri
Copy link
Copy Markdown

@felifri felifri commented May 18, 2026

Summary

Three minimal, additive changes let the schema represent text-to-image generation evaluations (FLUX, SDXL, Imagen, ...) alongside LLMs. Everything else (sampler args, image dimensions, hashes, human-rater pools) flows through the existing additional_details escape hatches, which keeps the surface area small.

  • modality — optional enum (text | text_to_image) on each evaluation_results[] entry and on each instance record. Absent means text for backwards compatibility.
  • output.media: MediaRef[] — generated artifacts on the instance record. MediaRef is intentionally minimal: required {media_type, uri} plus an additional_details bag for sha256, mime_type, width/height, seed, index, etc. Required when modality == "text_to_image". media_type is enum [image, video, audio] so the same shape extends to future modalities.
  • evaluation.is_correct widened to boolean | null — set to null when the metric is continuous (FID, CLIPScore, ImageReward, ...).

T2I uses interaction_type: "single_turn" — modality is the orthogonal axis, which keeps interaction_type logic untouched and lets future image-edit/video records slot in via a single enum extension.

Schema version fields are intentionally left at 0.2.2/instance_level_eval_0.2.2 — bumping is a maintainer call, not part of this PR. The Pydantic models are regenerated via the documented datamodel-codegen pipeline (no hand edits to generated files). One new validate_modality_consistency model validator is added through post_codegen.py so the "T2I requires output.media" constraint is enforced at validation time.

Existing converters and their tests are completely untouched.

What it looks like

A worked example is included at tests/data/t2i/geneval_sdxl_example.{json,jsonl} — minimal GenEval/SDXL fixture showing modality: text_to_image, T2I sampler args in generation_config.additional_details, output.media[] with MediaRefs, and is_correct: null.

Test plan

  • All 134 existing tests still pass
  • 5 new T2I tests cover: fixture validates, missing output.media fails with the modality validator's error, is_correct: null passes, unknown modality (image_edit) is rejected, and records without a modality field continue to validate as before
  • python -m every_eval_ever validate tests/data/t2i/ exits 0
  • Both schemas pass jsonschema.Draft7Validator.check_schema
  • Pydantic types regenerated cleanly via the documented uv run datamodel-codegen ... + uv run python post_codegen.py pipeline

felifri and others added 3 commits May 18, 2026 10:15
Three minimal, additive changes let the schema represent text-to-image
generation evaluations (FLUX, SDXL, Imagen, ...) alongside LLMs. Everything
else (sampler args, image dimensions, hashes, rater pools) flows through
the existing `additional_details` escape hatches, which keeps the surface
area small. Version fields are left at 0.2.2 — bumping is a maintainer
decision and not part of this PR.

Aggregate schema (every_eval_ever/schemas/eval.schema.json):
  - New optional `modality` enum (text | text_to_image) on each
    evaluation_results[] entry. Absent = text.

Instance schema (every_eval_ever/schemas/instance_level_eval.schema.json):
  - New optional `modality` enum (same values).
  - New `output.media: MediaRef[]` for generated artifacts. MediaRef is
    intentionally minimal: required {media_type, uri} plus an
    `additional_details` bag for sha256, mime_type, width/height, seed,
    index, etc. (`media_type` is enum [image, video, audio] so the same
    shape extends to future modalities without re-versioning.)
  - `evaluation.is_correct` widened to boolean|null. T2I records set it
    to null when the metric is continuous (FID, CLIPScore, ImageReward).

T2I uses `interaction_type: single_turn` — modality is the orthogonal
axis, which is the key reason this extends cleanly to image-edit / video
later (adding a modality enum value, not touching interaction logic).
Sampler args (num_inference_steps, guidance_scale, width/height, scheduler,
seed) go in `generation_config.additional_details` as string key-values;
human-rater pools go in `metric_config.additional_details` until a future
PR adds first-class structure for them.

Pydantic types regenerated via datamodel-codegen (the documented pipeline).

post_codegen.py:
  - New validate_modality_consistency model_validator on
    InstanceLevelEvaluationLog: modality == text_to_image requires
    non-null output with a non-empty media list.
  - Skip-check fix: the previous blanket "if 'post_codegen.py' in
    content" skip prevented a second validator from being appended to
    the same file. Now scopes the check to the specific validator method
    name so multiple validators can coexist.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
tests/data/t2i/:
  - geneval_sdxl_example.json: minimal GenEval aggregate for SDXL Base 1.0
    showing modality=text_to_image, a continuous VQA-style score
    (geneval.overall, metric_kind=vqa_score), and T2I sampler args
    (num_inference_steps, guidance_scale, width/height, scheduler, seed)
    living in generation_config.additional_details as string key-values.
  - geneval_sdxl_example_samples.jsonl: one per-sample line showing
    output.media[] with two MediaRefs (additional_details holding sha256,
    width/height, seed, index), empty answer_attribution, is_correct=null.

tests/test_validate.py — new TestT2I class with 5 cases:
  - geneval fixture passes
  - modality=text_to_image without output.media fails
  - modality=text_to_image with null is_correct passes
  - unknown modality value (e.g. image_edit) is rejected
  - existing records without a modality field still validate

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a short subsection (12 lines) after Agentic Evaluations explaining
the three minimal T2I additions (modality, output.media, is_correct
widening), the convention of using additional_details for sampler args
and rater pools, and pointing to the tests/data/t2i/ fixture for a
worked example. No schema-version mentions are altered.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Erotemic
Copy link
Copy Markdown
Collaborator

FYI: the link is broken.

Copy link
Copy Markdown
Collaborator

@Erotemic Erotemic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need a bit more discussion about what to do here. Not exactly sure what changes should be scoped here. I could also be overcomplicating things, so push back if I am.

Comment thread post_codegen.py
if self.max_score is None:
raise ValueError("score_type 'continuous' requires max_score")
return self
""",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not an issue with this PR, but python code in a string can be a smell. I don't quite understand the purpose of this file yet, but we may want to revisit the design that requires this ATM.

Comment thread tests/test_validate.py
# ===================================================================


T2I_FIXTURE_DIR = Path(__file__).parent / 'data' / 't2i'
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a blocker, but __file__ is almost never a robust way to handle finding resources. We should think about factoring out a EEE data repo that properly packages and provides access to the data needed for tests.

# --- validators (added by post_codegen.py) ---

@model_validator(mode='after')
@model_validator(mode="after")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should just use ruff format and settle on a quote style (I like single quotes) to avoid diffs like these. Not a blocker here, but something that should happen soon as the repo grows.

"description": "Reasoning traces of the model if applicable (e.g. chain-of-thought tokens)",
"items": { "type": "string" }
},
"media": {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this doesn't correspond with the pydantic validator. I'm not sure if it should. If it did it would get a bit wordy, i.e.

{
  "if": {
    "required": ["modality"],
    "properties": {
      "modality": { "const": "text_to_image" }
    }
  },
  "then": {
    "required": ["output"],
    "properties": {
      "output": {
        "type": "object",
        "required": ["media"],
        "properties": {
          "media": {
            "type": "array",
            "minItems": 1
          }
        }
      }
    }
  }
}

The larger issue is having two sources of truth for the schema. Does it make sense to go all in on pydantic and then have it generate the jsonschema?

},
"uri": {
"type": "string",
"description": "Location of the artifact: 'file://...', 'https://...', 'hf://...', 's3://...', or 'data:...;base64,...' for inline."
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the data:... option is a BAD idea here. This overscopes what this is. This should be a always be a reference to the data, not the data itself. If you want to support embedding the data itself it should be a separate optional field.

I think one URI is also a bad idea. There should be a list of suggested ways to access the data . Otherwise you run into a dead URL issue. It happens too often, and it limits the reproducibility value. URLs rot, they can be changed to point to something else (where the hash is important). I would also suggest thinking about allowing distributed content addressed references like IPFS CIDs or BitTorrent magnet URIs here, which avoid the problem where the content at an address can change, but do not address the link-rot issue (which is just always going to be a fundamental limitation of any reference based scheme).

score: float = Field(..., description='Instance-level score')
is_correct: bool = Field(
..., description='Whether the final answer is correct'
is_correct: bool | None = Field(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a better way that this can be expressed that generalizes better? I don't have an immediate idea, but this seems like the stat of a boolean explosion to me.

extra='forbid',
)
media_type: MediaType
uri: str = Field(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar URI comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants