feat(vision-metrics): add vision-based VLM judge metrics#640
feat(vision-metrics): add vision-based VLM judge metrics#640davidberenstein1957 wants to merge 3 commits intomainfrom
Conversation
- Add LLM2Vec from OneIG vendor source - Includes Llama encoder and bidirectional models - Self-contained, no dependencies on Pruna internals - Licensed under Apache 2.0
- Add BaseVLM abstract interface - Add LitellmVLM for API-based inference (OpenAI, Anthropic, etc.) - Add TransformersVLM for local Hugging Face models - Add StatefulVLMMeanScoresMetric base class for judge metrics - Add vlm_utils.py with image/batch utilities - Add pyproject.toml dependency pins (peft, litellm) - Add unit tests for infrastructure
- Add VieScoreMetric for vision-instruction-execution alignment - Add ImageEditScoreMetric for image editing evaluation - Add VQAMetric for visual question-answering - Register all vision metrics in registry - Add benchmark configs for vision-based evaluation - Include unit and integration tests with mocked VLM
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 11 potential issues.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Comment @cursor review or bugbot run to trigger another review on this PR
Reviewed by Cursor Bugbot for commit cba597b. Configure here.
| [project] | ||
| name = "pruna" | ||
| version = "0.3.3" | ||
| version = "0.3.2" |
There was a problem hiding this comment.
Package version downgraded from 0.3.3 to 0.3.2
High Severity
This PR accidentally reverts the package version from 0.3.3 to 0.3.2 and narrows the requires-python range, dropping Python 3.13 support. These regressions are unrelated to the PR's scope, potentially breaking installations for Python 3.13 users and causing package manager issues due to non-monotonic versioning.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit cba597b. Configure here.
| StatefulVLMMeanScoresMetric, | ||
| TransformersVLM, | ||
| get_vlm, | ||
| ) |
There was a problem hiding this comment.
RapidataMetric removed from public exports while still shipped
Medium Severity
Removing RapidataMetric from pruna.evaluation.metrics causes an ImportError for existing users. The metric's implementation, tests, and associated rapidata extra in CI/build configurations still exist and reference it, leading to broken tests and CI.
Reviewed by Cursor Bugbot for commit cba597b. Configure here.
| "peft>=0.18.0,<0.19.0", | ||
| "trl<=0.21.0", | ||
| "termcolor==2.3.0", | ||
| "realesrgan", |
There was a problem hiding this comment.
realesrgan promoted from extra to base dependency
Medium Severity
realesrgan is moved from the upscale optional extra into the mandatory base dependencies, and the upscale extra is deleted. This forces every install of pruna (including users who never touch upscaling) to pull in realesrgan and its heavy transitive deps, and silently breaks anyone who previously installed pruna[upscale].
Reviewed by Cursor Bugbot for commit cba597b. Configure here.
| return values[:2] | ||
| if not values: | ||
| return [0.0, 0.0] | ||
| return values + [0.0] * (2 - len(values)) |
There was a problem hiding this comment.
Single VIEScore sub-score collapses overall score to zero
Medium Severity
pad_viescore_subscores_to_two pads short sub-score lists with 0.0, and viescore_tie_overall_unit then takes min(SC) * min(PQ). When the VLM legitimately returns one well-formed sub-score (e.g. {"score": [8]}), padding injects a 0.0 and the overall score collapses to 0, indistinguishable from a fully failed evaluation. This silently penalises otherwise-valid VLM responses.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit cba597b. Configure here.
| """ | ||
| if not self.scores: | ||
| return MetricResult(self.metric_name, self.__dict__, 0.0) | ||
| return MetricResult(self.metric_name, self.__dict__, float(np.mean(self.scores))) |
There was a problem hiding this comment.
Empty metric compute returns 0.0 instead of NaN
Low Severity
compute_mean_of_scores returns MetricResult(..., 0.0) when no update calls succeeded. For higher_is_better=True metrics like vqa, vie_score, and img_edit_score, 0.0 is a valid worst score, so a benchmark run where every VLM call failed (no API key, network error, all parse errors) is indistinguishable from a model that scored zero. Returning NaN (or surfacing the empty state) would make this failure mode visible.
Reviewed by Cursor Bugbot for commit cba597b. Configure here.
| ), | ||
| metrics=[], # Paper uses custom evaluation; not in Pruna | ||
| task_type="text_to_image", | ||
| task_type=TASK_TYPE_TEXT_IMAGE, |
There was a problem hiding this comment.
task_type rename silently breaks benchmark filters
Medium Severity
Every text-to-image benchmark switches its task_type from "text_to_image" to "text_image". BenchmarkRegistry.list(task_type=...) matches on string equality, so any caller (including users) that filters by the previously documented "text_to_image" string will silently get an empty list with no error. The change also drops the verb to, so the new value diverges from the still-used "text_to_video".
Additional Locations (1)
Reviewed by Cursor Bugbot for commit cba597b. Configure here.
| """ | ||
| if name in cls._lazy_metrics and name not in cls._registry: | ||
| importlib.import_module("pruna.evaluation.metrics.metric_oneig_reasoning") | ||
|
|
There was a problem hiding this comment.
has_metric lies about lazy oneig_reasoning availability
Medium Severity
has_metric returns True for oneig_reasoning even before the implementing module has been imported, so BenchmarkRegistry._register accepts the OneIG Knowledge Reasoning benchmark unconditionally. If importing metric_oneig_reasoning later fails (heavy LLM2CLIP/PEFT/flash-attn stack, optional deps), get_metric("oneig_reasoning") will surface the failure only at evaluation time, after benchmarks claim to support it.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit cba597b. Configure here.
| _FALLBACK_QUESTION = ( | ||
| 'On a scale of 0 to 10, how well does this edited image follow the instruction "{prompt}"? ' | ||
| "0 = instruction not followed at all, 10 = perfectly executed. Reply with a single number." | ||
| ) |
There was a problem hiding this comment.
Fallback question format breaks on prompt braces
Low Severity
_FALLBACK_QUESTION.format(prompt=prompt) is called on every fallback path with the user-provided edit instruction. Any { or } in the instruction (e.g. JSON-style examples or templated text) raises KeyError/IndexError from str.format, aborting the whole batch update instead of just producing a single bad score.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit cba597b. Configure here.
| match = re.search(r"\d+(?:\.\d+)?", text) | ||
| if match: | ||
| return min(float(match.group(0)), 10.0) / 10.0 | ||
| return 0.0 |
There was a problem hiding this comment.
Negative score regex returns inflated value
Low Severity
get_score_from_response claims to be Always non-negative and clamps the dict/JSON paths, but the plain-text fallback uses re.search(r"\d+(?:\.\d+)?", text) which strips a leading minus. A response like "-10" matches as 10 and returns 1.0, the maximum score, instead of 0.0.
Reviewed by Cursor Bugbot for commit cba597b. Configure here.
| device=device, | ||
| api_key=api_key, | ||
| call_type=call_type, | ||
| ) |
There was a problem hiding this comment.
VieScore and ImgEdit silently swallow positional args
Low Severity
Both VieScoreMetric.__init__ and ImageEditScoreMetric.__init__ accept *args and **kwargs but never forward them anywhere. Any positional argument a user passes is silently discarded and any unknown keyword is dropped without warning, so typos like modle_name=... quietly construct a metric using the default litellm configuration.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit cba597b. Configure here.


Summary
Three vision-based judge metrics using visual understanding. Evaluates generated images and visual outputs against instructions and reference images.
Metrics Added
VieScoreMetric(363 lines)Vision-instruction-execution alignment. Does the generated image match the instruction?
ImageEditScoreMetric(219 lines)Image editing task quality evaluation.
VQAMetric(158 lines)Visual question-answering on generated/edited images.
Files
New:
src/pruna/evaluation/metrics/metric_vie_score.py— VieScore metricsrc/pruna/evaluation/metrics/metric_img_edit_score.py— Image editing metricsrc/pruna/evaluation/metrics/metric_vqa.py— VQA metrictests/evaluation/test_vision_metrics.py(684 lines) — Unit + integration testsModified:
src/pruna/evaluation/metrics/__init__.py— Export 3 metricssrc/pruna/evaluation/metrics/registry.py— Register in metric registrysrc/pruna/evaluation/benchmarks.py— Add vision metric benchmark configsTesting
Usage
Benchmarks
New benchmark configs:
vision-judge-vie— VieScore instruction alignmentvision-judge-edit— ImageEditScore qualityvision-judge-vqa— VQA correctnessImage Handling
Context
Part of 5-PR segmentation:
Dependencies: PR-2 (infrastructure + VLM base classes)
Blocks: PR-5 (e2e tests depend on all metrics)
Review Focus
🤖 Generated with Claude Code