Skip to content

feat(vision-metrics): add vision-based VLM judge metrics#640

Closed
davidberenstein1957 wants to merge 3 commits intomainfrom
feat/vlm-pr-4-vision-metrics
Closed

feat(vision-metrics): add vision-based VLM judge metrics#640
davidberenstein1957 wants to merge 3 commits intomainfrom
feat/vlm-pr-4-vision-metrics

Conversation

@davidberenstein1957
Copy link
Copy Markdown
Member

Summary

Three vision-based judge metrics using visual understanding. Evaluates generated images and visual outputs against instructions and reference images.

Metrics Added

VieScoreMetric (363 lines)

Vision-instruction-execution alignment. Does the generated image match the instruction?

  • Input: image + instruction text
  • Output: 0-1 alignment score
  • Use case: Image generation task evaluation, instruction following
  • Supports: Vision + instruction reasoning in single VLM call

ImageEditScoreMetric (219 lines)

Image editing task quality evaluation.

  • Input: original image + edited image + edit instruction
  • Output: 0-1 edit quality score
  • Use case: Image inpainting, style transfer, object manipulation
  • Evaluates: Instruction adherence + edit believability

VQAMetric (158 lines)

Visual question-answering on generated/edited images.

  • Input: image + question
  • Output: 0-1 correctness score
  • Use case: Visual reasoning evaluation
  • Fallback: AlignmentScore if VQA unavailable

Files

New:

  • src/pruna/evaluation/metrics/metric_vie_score.py — VieScore metric
  • src/pruna/evaluation/metrics/metric_img_edit_score.py — Image editing metric
  • src/pruna/evaluation/metrics/metric_vqa.py — VQA metric
  • tests/evaluation/test_vision_metrics.py (684 lines) — Unit + integration tests

Modified:

  • src/pruna/evaluation/metrics/__init__.py — Export 3 metrics
  • src/pruna/evaluation/metrics/registry.py — Register in metric registry
  • src/pruna/evaluation/benchmarks.py — Add vision metric benchmark configs

Testing

# Unit tests (mocked VLM)
pytest tests/evaluation/test_vision_metrics.py -v

# Specific metric
pytest tests/evaluation/test_vision_metrics.py::test_vie_score_metric -v

# With real VLM (requires OPENAI_API_KEY)
pytest tests/evaluation/test_vision_metrics.py::test_vie_score_real -v

Usage

from pruna.evaluation.metrics import VieScoreMetric

metric = VieScoreMetric(
    vlm_type="litellm",
    model_name="openai/gpt-4o"
)

score = metric(
    image=generated_image,
    instruction="A dog wearing sunglasses"
)
print(score)  # 0.87 — strong alignment with instruction

Benchmarks

New benchmark configs:

  • vision-judge-vie — VieScore instruction alignment
  • vision-judge-edit — ImageEditScore quality
  • vision-judge-vqa — VQA correctness

Image Handling

  • Accepts: torch tensors, PIL Images, numpy arrays
  • Automatic device placement (cpu/cuda/mps)
  • Batch support (metrics aggregate across batches)
  • EXIF stripping (privacy, consistency)

Context

Part of 5-PR segmentation:

  1. PR-1: Vendor Code ✓
  2. PR-2: Infrastructure ✓
  3. PR-3: Text Metrics ✓
  4. [THIS] PR-4: Vision Metrics — Second metric family
  5. PR-5: E2E Tests — Integration + docs

Dependencies: PR-2 (infrastructure + VLM base classes)
Blocks: PR-5 (e2e tests depend on all metrics)

Review Focus

  • Image preprocessing (normalization, resizing, privacy)
  • Prompt engineering for visual tasks (effectiveness, injection safety)
  • Score calibration (is 0.5 truly neutral?)
  • Device handling (image tensor placement, GPU memory)
  • Test coverage (various image formats, dimensions, edge cases)
  • VQA fallback logic (when does it trigger? safely?)
  • Docstring accuracy (do docstrings match behavior?)

🤖 Generated with Claude Code

- Add LLM2Vec from OneIG vendor source
- Includes Llama encoder and bidirectional models
- Self-contained, no dependencies on Pruna internals
- Licensed under Apache 2.0
- Add BaseVLM abstract interface
- Add LitellmVLM for API-based inference (OpenAI, Anthropic, etc.)
- Add TransformersVLM for local Hugging Face models
- Add StatefulVLMMeanScoresMetric base class for judge metrics
- Add vlm_utils.py with image/batch utilities
- Add pyproject.toml dependency pins (peft, litellm)
- Add unit tests for infrastructure
- Add VieScoreMetric for vision-instruction-execution alignment
- Add ImageEditScoreMetric for image editing evaluation
- Add VQAMetric for visual question-answering
- Register all vision metrics in registry
- Add benchmark configs for vision-based evaluation
- Include unit and integration tests with mocked VLM
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 11 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Comment @cursor review or bugbot run to trigger another review on this PR

Reviewed by Cursor Bugbot for commit cba597b. Configure here.

Comment thread pyproject.toml
[project]
name = "pruna"
version = "0.3.3"
version = "0.3.2"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Package version downgraded from 0.3.3 to 0.3.2

High Severity

This PR accidentally reverts the package version from 0.3.3 to 0.3.2 and narrows the requires-python range, dropping Python 3.13 support. These regressions are unrelated to the PR's scope, potentially breaking installations for Python 3.13 users and causing package manager issues due to non-monotonic versioning.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit cba597b. Configure here.

StatefulVLMMeanScoresMetric,
TransformersVLM,
get_vlm,
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RapidataMetric removed from public exports while still shipped

Medium Severity

Removing RapidataMetric from pruna.evaluation.metrics causes an ImportError for existing users. The metric's implementation, tests, and associated rapidata extra in CI/build configurations still exist and reference it, leading to broken tests and CI.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit cba597b. Configure here.

Comment thread pyproject.toml
"peft>=0.18.0,<0.19.0",
"trl<=0.21.0",
"termcolor==2.3.0",
"realesrgan",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

realesrgan promoted from extra to base dependency

Medium Severity

realesrgan is moved from the upscale optional extra into the mandatory base dependencies, and the upscale extra is deleted. This forces every install of pruna (including users who never touch upscaling) to pull in realesrgan and its heavy transitive deps, and silently breaks anyone who previously installed pruna[upscale].

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit cba597b. Configure here.

return values[:2]
if not values:
return [0.0, 0.0]
return values + [0.0] * (2 - len(values))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Single VIEScore sub-score collapses overall score to zero

Medium Severity

pad_viescore_subscores_to_two pads short sub-score lists with 0.0, and viescore_tie_overall_unit then takes min(SC) * min(PQ). When the VLM legitimately returns one well-formed sub-score (e.g. {"score": [8]}), padding injects a 0.0 and the overall score collapses to 0, indistinguishable from a fully failed evaluation. This silently penalises otherwise-valid VLM responses.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit cba597b. Configure here.

"""
if not self.scores:
return MetricResult(self.metric_name, self.__dict__, 0.0)
return MetricResult(self.metric_name, self.__dict__, float(np.mean(self.scores)))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Empty metric compute returns 0.0 instead of NaN

Low Severity

compute_mean_of_scores returns MetricResult(..., 0.0) when no update calls succeeded. For higher_is_better=True metrics like vqa, vie_score, and img_edit_score, 0.0 is a valid worst score, so a benchmark run where every VLM call failed (no API key, network error, all parse errors) is indistinguishable from a model that scored zero. Returning NaN (or surfacing the empty state) would make this failure mode visible.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit cba597b. Configure here.

),
metrics=[], # Paper uses custom evaluation; not in Pruna
task_type="text_to_image",
task_type=TASK_TYPE_TEXT_IMAGE,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

task_type rename silently breaks benchmark filters

Medium Severity

Every text-to-image benchmark switches its task_type from "text_to_image" to "text_image". BenchmarkRegistry.list(task_type=...) matches on string equality, so any caller (including users) that filters by the previously documented "text_to_image" string will silently get an empty list with no error. The change also drops the verb to, so the new value diverges from the still-used "text_to_video".

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit cba597b. Configure here.

"""
if name in cls._lazy_metrics and name not in cls._registry:
importlib.import_module("pruna.evaluation.metrics.metric_oneig_reasoning")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

has_metric lies about lazy oneig_reasoning availability

Medium Severity

has_metric returns True for oneig_reasoning even before the implementing module has been imported, so BenchmarkRegistry._register accepts the OneIG Knowledge Reasoning benchmark unconditionally. If importing metric_oneig_reasoning later fails (heavy LLM2CLIP/PEFT/flash-attn stack, optional deps), get_metric("oneig_reasoning") will surface the failure only at evaluation time, after benchmarks claim to support it.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit cba597b. Configure here.

_FALLBACK_QUESTION = (
'On a scale of 0 to 10, how well does this edited image follow the instruction "{prompt}"? '
"0 = instruction not followed at all, 10 = perfectly executed. Reply with a single number."
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fallback question format breaks on prompt braces

Low Severity

_FALLBACK_QUESTION.format(prompt=prompt) is called on every fallback path with the user-provided edit instruction. Any { or } in the instruction (e.g. JSON-style examples or templated text) raises KeyError/IndexError from str.format, aborting the whole batch update instead of just producing a single bad score.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit cba597b. Configure here.

match = re.search(r"\d+(?:\.\d+)?", text)
if match:
return min(float(match.group(0)), 10.0) / 10.0
return 0.0
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Negative score regex returns inflated value

Low Severity

get_score_from_response claims to be Always non-negative and clamps the dict/JSON paths, but the plain-text fallback uses re.search(r"\d+(?:\.\d+)?", text) which strips a leading minus. A response like "-10" matches as 10 and returns 1.0, the maximum score, instead of 0.0.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit cba597b. Configure here.

device=device,
api_key=api_key,
call_type=call_type,
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VieScore and ImgEdit silently swallow positional args

Low Severity

Both VieScoreMetric.__init__ and ImageEditScoreMetric.__init__ accept *args and **kwargs but never forward them anywhere. Any positional argument a user passes is silently discarded and any unknown keyword is dropped without warning, so typos like modle_name=... quietly construct a metric using the default litellm configuration.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit cba597b. Configure here.

@davidberenstein1957
Copy link
Copy Markdown
Member Author

Superseded by metric-focused stacked PRs: #645, #646, #647, #648, #649, #650, #651, and stacked e2e #641.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant