feat(vision-metrics): add vision-based VLM judge metrics by davidberenstein1957 · Pull Request #640 · PrunaAI/pruna

davidberenstein1957 · 2026-04-25T12:52:40Z

Summary

Three vision-based judge metrics using visual understanding. Evaluates generated images and visual outputs against instructions and reference images.

Metrics Added

`VieScoreMetric` (363 lines)

Vision-instruction-execution alignment. Does the generated image match the instruction?

Input: image + instruction text
Output: 0-1 alignment score
Use case: Image generation task evaluation, instruction following
Supports: Vision + instruction reasoning in single VLM call

`ImageEditScoreMetric` (219 lines)

Image editing task quality evaluation.

Input: original image + edited image + edit instruction
Output: 0-1 edit quality score
Use case: Image inpainting, style transfer, object manipulation
Evaluates: Instruction adherence + edit believability

`VQAMetric` (158 lines)

Visual question-answering on generated/edited images.

Input: image + question
Output: 0-1 correctness score
Use case: Visual reasoning evaluation
Fallback: AlignmentScore if VQA unavailable

Files

New:

src/pruna/evaluation/metrics/metric_vie_score.py — VieScore metric
src/pruna/evaluation/metrics/metric_img_edit_score.py — Image editing metric
src/pruna/evaluation/metrics/metric_vqa.py — VQA metric
tests/evaluation/test_vision_metrics.py (684 lines) — Unit + integration tests

Modified:

src/pruna/evaluation/metrics/__init__.py — Export 3 metrics
src/pruna/evaluation/metrics/registry.py — Register in metric registry
src/pruna/evaluation/benchmarks.py — Add vision metric benchmark configs

Testing

# Unit tests (mocked VLM)
pytest tests/evaluation/test_vision_metrics.py -v

# Specific metric
pytest tests/evaluation/test_vision_metrics.py::test_vie_score_metric -v

# With real VLM (requires OPENAI_API_KEY)
pytest tests/evaluation/test_vision_metrics.py::test_vie_score_real -v

Usage

from pruna.evaluation.metrics import VieScoreMetric

metric = VieScoreMetric(
    vlm_type="litellm",
    model_name="openai/gpt-4o"
)

score = metric(
    image=generated_image,
    instruction="A dog wearing sunglasses"
)
print(score)  # 0.87 — strong alignment with instruction

Benchmarks

New benchmark configs:

vision-judge-vie — VieScore instruction alignment
vision-judge-edit — ImageEditScore quality
vision-judge-vqa — VQA correctness

Image Handling

Accepts: torch tensors, PIL Images, numpy arrays
Automatic device placement (cpu/cuda/mps)
Batch support (metrics aggregate across batches)
EXIF stripping (privacy, consistency)

Context

Part of 5-PR segmentation:

PR-1: Vendor Code ✓
PR-2: Infrastructure ✓
PR-3: Text Metrics ✓
[THIS] PR-4: Vision Metrics — Second metric family
PR-5: E2E Tests — Integration + docs

Dependencies: PR-2 (infrastructure + VLM base classes)
Blocks: PR-5 (e2e tests depend on all metrics)

Review Focus

Image preprocessing (normalization, resizing, privacy)
Prompt engineering for visual tasks (effectiveness, injection safety)
Score calibration (is 0.5 truly neutral?)
Device handling (image tensor placement, GPU memory)
Test coverage (various image formats, dimensions, edge cases)
VQA fallback logic (when does it trigger? safely?)
Docstring accuracy (do docstrings match behavior?)

🤖 Generated with Claude Code

- Add LLM2Vec from OneIG vendor source - Includes Llama encoder and bidirectional models - Self-contained, no dependencies on Pruna internals - Licensed under Apache 2.0

- Add BaseVLM abstract interface - Add LitellmVLM for API-based inference (OpenAI, Anthropic, etc.) - Add TransformersVLM for local Hugging Face models - Add StatefulVLMMeanScoresMetric base class for judge metrics - Add vlm_utils.py with image/batch utilities - Add pyproject.toml dependency pins (peft, litellm) - Add unit tests for infrastructure

- Add VieScoreMetric for vision-instruction-execution alignment - Add ImageEditScoreMetric for image editing evaluation - Add VQAMetric for visual question-answering - Register all vision metrics in registry - Add benchmark configs for vision-based evaluation - Include unit and integration tests with mocked VLM

cursor

Cursor Bugbot has reviewed your changes and found 11 potential issues.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

Comment @cursor review or bugbot run to trigger another review on this PR

^{Reviewed by Cursor Bugbot for commit cba597b. Configure here.}

cursor · 2026-04-25T12:58:29Z

 [project]
 name = "pruna"
-version = "0.3.3"
+version = "0.3.2"


Package version downgraded from 0.3.3 to 0.3.2

High Severity

This PR accidentally reverts the package version from 0.3.3 to 0.3.2 and narrows the requires-python range, dropping Python 3.13 support. These regressions are unrelated to the PR's scope, potentially breaking installations for Python 3.13 users and causing package manager issues due to non-monotonic versioning.

Additional Locations (1)

pyproject.toml#L102-L103

^{Reviewed by Cursor Bugbot for commit cba597b. Configure here.}

cursor · 2026-04-25T12:58:29Z

+    StatefulVLMMeanScoresMetric,
+    TransformersVLM,
+    get_vlm,
+)


RapidataMetric removed from public exports while still shipped

Medium Severity

Removing RapidataMetric from pruna.evaluation.metrics causes an ImportError for existing users. The metric's implementation, tests, and associated rapidata extra in CI/build configurations still exist and reference it, leading to broken tests and CI.

^{Reviewed by Cursor Bugbot for commit cba597b. Configure here.}

cursor · 2026-04-25T12:58:29Z

    "peft>=0.18.0,<0.19.0",
    "trl<=0.21.0",
    "termcolor==2.3.0",
+    "realesrgan",


realesrgan promoted from extra to base dependency

Medium Severity

realesrgan is moved from the upscale optional extra into the mandatory base dependencies, and the upscale extra is deleted. This forces every install of pruna (including users who never touch upscaling) to pull in realesrgan and its heavy transitive deps, and silently breaks anyone who previously installed pruna[upscale].

^{Reviewed by Cursor Bugbot for commit cba597b. Configure here.}

cursor · 2026-04-25T12:58:30Z

+        return values[:2]
+    if not values:
+        return [0.0, 0.0]
+    return values + [0.0] * (2 - len(values))


Single VIEScore sub-score collapses overall score to zero

Medium Severity

pad_viescore_subscores_to_two pads short sub-score lists with 0.0, and viescore_tie_overall_unit then takes min(SC) * min(PQ). When the VLM legitimately returns one well-formed sub-score (e.g. {"score": [8]}), padding injects a 0.0 and the overall score collapses to 0, indistinguishable from a fully failed evaluation. This silently penalises otherwise-valid VLM responses.

Additional Locations (1)

src/pruna/evaluation/metrics/vlm_utils.py#L368-L394

^{Reviewed by Cursor Bugbot for commit cba597b. Configure here.}

cursor · 2026-04-25T12:58:30Z

+        """
+        if not self.scores:
+            return MetricResult(self.metric_name, self.__dict__, 0.0)
+        return MetricResult(self.metric_name, self.__dict__, float(np.mean(self.scores)))


Empty metric compute returns 0.0 instead of NaN

Low Severity

compute_mean_of_scores returns MetricResult(..., 0.0) when no update calls succeeded. For higher_is_better=True metrics like vqa, vie_score, and img_edit_score, 0.0 is a valid worst score, so a benchmark run where every VLM call failed (no API key, network error, all parse errors) is indistinguishable from a model that scored zero. Returning NaN (or surfacing the empty state) would make this failure mode visible.

^{Reviewed by Cursor Bugbot for commit cba597b. Configure here.}

cursor · 2026-04-25T12:58:30Z

        ),
        metrics=[],  # Paper uses custom evaluation; not in Pruna
-        task_type="text_to_image",
+        task_type=TASK_TYPE_TEXT_IMAGE,


task_type rename silently breaks benchmark filters

Medium Severity

Every text-to-image benchmark switches its task_type from "text_to_image" to "text_image". BenchmarkRegistry.list(task_type=...) matches on string equality, so any caller (including users) that filters by the previously documented "text_to_image" string will silently get an empty list with no error. The change also drops the verb to, so the new value diverges from the still-used "text_to_video".

Additional Locations (1)

src/pruna/evaluation/benchmarks.py#L128-L146

^{Reviewed by Cursor Bugbot for commit cba597b. Configure here.}

cursor · 2026-04-25T12:58:30Z

        """
+        if name in cls._lazy_metrics and name not in cls._registry:
+            importlib.import_module("pruna.evaluation.metrics.metric_oneig_reasoning")
+


has_metric lies about lazy oneig_reasoning availability

Medium Severity

has_metric returns True for oneig_reasoning even before the implementing module has been imported, so BenchmarkRegistry._register accepts the OneIG Knowledge Reasoning benchmark unconditionally. If importing metric_oneig_reasoning later fails (heavy LLM2CLIP/PEFT/flash-attn stack, optional deps), get_metric("oneig_reasoning") will surface the failure only at evaluation time, after benchmarks claim to support it.

Additional Locations (1)

src/pruna/evaluation/benchmarks.py#L90-L101

^{Reviewed by Cursor Bugbot for commit cba597b. Configure here.}

cursor · 2026-04-25T12:58:30Z

+_FALLBACK_QUESTION = (
+    'On a scale of 0 to 10, how well does this edited image follow the instruction "{prompt}"? '
+    "0 = instruction not followed at all, 10 = perfectly executed. Reply with a single number."
+)


Fallback question format breaks on prompt braces

Low Severity

_FALLBACK_QUESTION.format(prompt=prompt) is called on every fallback path with the user-provided edit instruction. Any { or } in the instruction (e.g. JSON-style examples or templated text) raises KeyError/IndexError from str.format, aborting the whole batch update instead of just producing a single bad score.

Additional Locations (1)

src/pruna/evaluation/metrics/metric_img_edit_score.py#L205-L208

^{Reviewed by Cursor Bugbot for commit cba597b. Configure here.}

cursor · 2026-04-25T12:58:30Z

+    match = re.search(r"\d+(?:\.\d+)?", text)
+    if match:
+        return min(float(match.group(0)), 10.0) / 10.0
+    return 0.0


Negative score regex returns inflated value

Low Severity

get_score_from_response claims to be Always non-negative and clamps the dict/JSON paths, but the plain-text fallback uses re.search(r"\d+(?:\.\d+)?", text) which strips a leading minus. A response like "-10" matches as 10 and returns 1.0, the maximum score, instead of 0.0.

^{Reviewed by Cursor Bugbot for commit cba597b. Configure here.}

cursor · 2026-04-25T12:58:30Z

+            device=device,
+            api_key=api_key,
+            call_type=call_type,
+        )


VieScore and ImgEdit silently swallow positional args

Low Severity

Both VieScoreMetric.__init__ and ImageEditScoreMetric.__init__ accept *args and **kwargs but never forward them anywhere. Any positional argument a user passes is silently discarded and any unknown keyword is dropped without warning, so typos like modle_name=... quietly construct a metric using the default litellm configuration.

Additional Locations (1)

src/pruna/evaluation/metrics/metric_img_edit_score.py#L134-L160

^{Reviewed by Cursor Bugbot for commit cba597b. Configure here.}

davidberenstein1957 · 2026-04-28T13:06:56Z

Superseded by metric-focused stacked PRs: #645, #646, #647, #648, #649, #650, #651, and stacked e2e #641.

davidberenstein1957 added 3 commits April 25, 2026 12:49

feat(vendor): add LLM2Vec embedding model

c933c4d

- Add LLM2Vec from OneIG vendor source - Includes Llama encoder and bidirectional models - Self-contained, no dependencies on Pruna internals - Licensed under Apache 2.0

cursor Bot reviewed Apr 25, 2026

View reviewed changes

davidberenstein1957 closed this Apr 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(vision-metrics): add vision-based VLM judge metrics#640

feat(vision-metrics): add vision-based VLM judge metrics#640
davidberenstein1957 wants to merge 3 commits intomainfrom
feat/vlm-pr-4-vision-metrics

davidberenstein1957 commented Apr 25, 2026

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot Apr 25, 2026

Uh oh!

cursor Bot Apr 25, 2026

Uh oh!

cursor Bot Apr 25, 2026

Uh oh!

cursor Bot Apr 25, 2026

Uh oh!

cursor Bot Apr 25, 2026

Uh oh!

cursor Bot Apr 25, 2026

Uh oh!

cursor Bot Apr 25, 2026

Uh oh!

cursor Bot Apr 25, 2026

Uh oh!

cursor Bot Apr 25, 2026

Uh oh!

cursor Bot Apr 25, 2026

Uh oh!

davidberenstein1957 commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

davidberenstein1957 commented Apr 25, 2026

Summary

Metrics Added

VieScoreMetric (363 lines)

ImageEditScoreMetric (219 lines)

VQAMetric (158 lines)

Files

Testing

Usage

Benchmarks

Image Handling

Context

Review Focus

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Apr 25, 2026

Choose a reason for hiding this comment

Package version downgraded from 0.3.3 to 0.3.2

Uh oh!

cursor Bot Apr 25, 2026

Choose a reason for hiding this comment

RapidataMetric removed from public exports while still shipped

Uh oh!

cursor Bot Apr 25, 2026

Choose a reason for hiding this comment

realesrgan promoted from extra to base dependency

Uh oh!

cursor Bot Apr 25, 2026

Choose a reason for hiding this comment

Single VIEScore sub-score collapses overall score to zero

Uh oh!

cursor Bot Apr 25, 2026

Choose a reason for hiding this comment

Empty metric compute returns 0.0 instead of NaN

Uh oh!

cursor Bot Apr 25, 2026

Choose a reason for hiding this comment

task_type rename silently breaks benchmark filters

Uh oh!

cursor Bot Apr 25, 2026

Choose a reason for hiding this comment

has_metric lies about lazy oneig_reasoning availability

Uh oh!

cursor Bot Apr 25, 2026

Choose a reason for hiding this comment

Fallback question format breaks on prompt braces

Uh oh!

cursor Bot Apr 25, 2026

Choose a reason for hiding this comment

Negative score regex returns inflated value

Uh oh!

cursor Bot Apr 25, 2026

Choose a reason for hiding this comment

VieScore and ImgEdit silently swallow positional args

Uh oh!

davidberenstein1957 commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

`VieScoreMetric` (363 lines)

`ImageEditScoreMetric` (219 lines)

`VQAMetric` (158 lines)