feat(text-metrics): split qa_accuracy#645
feat(text-metrics): split qa_accuracy#645davidberenstein1957 wants to merge 1 commit intofeat/vlm-pr-2-infrastructurefrom
Conversation
Isolates qa_accuracy metric implementation and GenEval benchmark wiring so it can be reviewed independently before stacking the remaining text metrics. Made-with: Cursor
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Comment @cursor review or bugbot run to trigger another review on this PR
Reviewed by Cursor Bugbot for commit 15db155. Configure here.
| "between prompts and generated images via VQA-style questions." | ||
| ), | ||
| metrics=["clip_score"], # §3.2: Mask2Former; not in Pruna | ||
| metrics=["qa_accuracy", "clip_score"], # strict QA + CLIP score |
There was a problem hiding this comment.
GenEval benchmark uses lenient mean aggregation
Medium Severity
The GenEval benchmark's qa_accuracy metric is intended for "strict QA" (all-or-nothing aggregation). However, Task.from_benchmark doesn't pass the all_or_nothing kwarg, causing QAAccuracyMetric to default to mean aggregation. This leads to inflated scores that are not comparable to the paper's reference.
Reviewed by Cursor Bugbot for commit 15db155. Configure here.
| raise ValueError( | ||
| f"qa_accuracy aggregation must be one of {{'mean', 'all_or_nothing'}}. Got: {self.aggregation!r}." | ||
| ) | ||
| self.metric_units = type(self).metric_units |
There was a problem hiding this comment.
Redundant metric_units self-assignment in __init__
Low Severity
Setting self.metric_units = type(self).metric_units is a no-op because metric_units is already declared as a class attribute and nothing earlier in __init__ (neither super().__init__ nor the surrounding code) overrides it. The line just shadows the class attribute with the same value and adds maintenance noise.
Reviewed by Cursor Bugbot for commit 15db155. Configure here.


Summary
qa_accuracyinto its own stacked branch/PRQAAccuracyMetricimplementationqa_accuracy+clip_scoreTest plan
uv run pytest tests/evaluation/test_text_metrics.py -k qa_accuracyMade with Cursor