feat(text-metrics): split oneig_reasoning#648
feat(text-metrics): split oneig_reasoning#648davidberenstein1957 wants to merge 1 commit intofeat/vlm-pr-3c-text-score-pairfrom
Conversation
Adds oneig_reasoning metric, lazy registry loading, and focused benchmark/test wiring as the final text-metric stack branch. Made-with: Cursor
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 3 potential issues.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Comment @cursor review or bugbot run to trigger another review on this PR
Reviewed by Cursor Bugbot for commit e586366. Configure here.
| model_name=self.model_name, | ||
| llm_model_name=self.llm_model_name, | ||
| device=self.device, | ||
| ) |
There was a problem hiding this comment.
Scorer not cached, models reload every batch
High Severity
_get_scorer creates a brand new _LLM2CLIPScorer instance on every update() call when no mock scorer is injected (the production path). Because _LLM2CLIPScorer._load_models lazily loads models by checking self._clip_model is not None, and each new instance starts with self._clip_model = None, the multi-gigabyte LLM2CLIP checkpoints are re-loaded from disk on every batch. The method needs to cache the newly created scorer in self._scorer before returning it.
Reviewed by Cursor Bugbot for commit e586366. Configure here.
| self.model_name, | ||
| self.llm_model_name, | ||
| ) | ||
| dtype = torch.bfloat16 if self.device == "cuda" else torch.float32 |
There was a problem hiding this comment.
Device check misses "cuda:N" device strings
Medium Severity
The dtype selection (self.device == "cuda") and the autocast guard in score both use exact string equality, so they fail to match indexed device strings like "cuda:0". When the framework's device resolution produces "cuda:0" (which set_to_best_available_device("cuda") does via _resolve_cuda_device), the model loads in float32 instead of bfloat16 and autocast is skipped. The nearby attn_impl line correctly uses startswith("cuda:") but these two checks do not.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit e586366. Configure here.
| input_pixels = self._processor(images=pil_images, return_tensors="pt").pixel_values.to(self.device) | ||
| captions = [text_prompt] | ||
| text_features = self._l2v.encode(captions, convert_to_tensor=True, device=self.device).to(self.device) | ||
| text_features = self._clip_model.get_text_features(text_features) |
There was a problem hiding this comment.
Text features computed outside torch.no_grad() context
Medium Severity
In _LLM2CLIPScorer.score, self._clip_model.get_text_features(text_features) runs outside torch.no_grad(), while the analogous get_image_features call is correctly wrapped. The model's parameters still have requires_grad=True (only train(mode=False) was called, not requires_grad_(False)), so PyTorch builds a full computation graph for the text projection forward pass that is never used for backprop. This wastes GPU memory and is inconsistent with the image feature computation on the very next line.
Reviewed by Cursor Bugbot for commit e586366. Configure here.


Summary
oneig_reasoninginto its own stacked PROneIGReasoningMetricimplementation and lazyMetricRegistryloadingTest plan
uv run pytest tests/evaluation/test_text_metrics.py -k oneig_reasoningMade with Cursor