feat(text-metrics): split qa_accuracy by davidberenstein1957 · Pull Request #645 · PrunaAI/pruna

davidberenstein1957 · 2026-04-28T13:03:56Z

Summary

Split qa_accuracy into its own stacked branch/PR
Add QAAccuracyMetric implementation
Wire GenEval benchmark to qa_accuracy + clip_score

Test plan

Run uv run pytest tests/evaluation/test_text_metrics.py -k qa_accuracy

Made with Cursor

Isolates qa_accuracy metric implementation and GenEval benchmark wiring so it can be reviewed independently before stacking the remaining text metrics. Made-with: Cursor

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

Comment @cursor review or bugbot run to trigger another review on this PR

^{Reviewed by Cursor Bugbot for commit 15db155. Configure here.}

cursor · 2026-04-28T13:09:41Z

            "between prompts and generated images via VQA-style questions."
        ),
-        metrics=["clip_score"],  # §3.2: Mask2Former; not in Pruna
+        metrics=["qa_accuracy", "clip_score"],  # strict QA + CLIP score


GenEval benchmark uses lenient mean aggregation

Medium Severity

The GenEval benchmark's qa_accuracy metric is intended for "strict QA" (all-or-nothing aggregation). However, Task.from_benchmark doesn't pass the all_or_nothing kwarg, causing QAAccuracyMetric to default to mean aggregation. This leads to inflated scores that are not comparable to the paper's reference.

^{Reviewed by Cursor Bugbot for commit 15db155. Configure here.}

cursor · 2026-04-28T13:09:41Z

+            raise ValueError(
+                f"qa_accuracy aggregation must be one of {{'mean', 'all_or_nothing'}}. Got: {self.aggregation!r}."
+            )
+        self.metric_units = type(self).metric_units


Redundant metric_units self-assignment in __init__

Low Severity

Setting self.metric_units = type(self).metric_units is a no-op because metric_units is already declared as a class attribute and nothing earlier in __init__ (neither super().__init__ nor the surrounding code) overrides it. The line just shadows the class attribute with the same value and adds maintenance noise.

^{Reviewed by Cursor Bugbot for commit 15db155. Configure here.}

feat(text-metrics): split qa_accuracy into dedicated PR branch

15db155

Isolates qa_accuracy metric implementation and GenEval benchmark wiring so it can be reviewed independently before stacking the remaining text metrics. Made-with: Cursor

This was referenced Apr 28, 2026

feat(text-metrics): add text-based VLM judge metrics #639

Closed

feat(vision-metrics): add vision-based VLM judge metrics #640

Closed

cursor Bot reviewed Apr 28, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(text-metrics): split qa_accuracy#645

feat(text-metrics): split qa_accuracy#645
davidberenstein1957 wants to merge 1 commit intofeat/vlm-pr-2-infrastructurefrom
feat/vlm-pr-3a-qa-accuracy

davidberenstein1957 commented Apr 28, 2026

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot Apr 28, 2026

Uh oh!

cursor Bot Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

davidberenstein1957 commented Apr 28, 2026

Summary

Test plan

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Apr 28, 2026

Choose a reason for hiding this comment

GenEval benchmark uses lenient mean aggregation

Uh oh!

cursor Bot Apr 28, 2026

Choose a reason for hiding this comment

Redundant metric_units self-assignment in __init__

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Redundant `metric_units` self-assignment in `init`