feat(e2e-tests): stacked e2e after split metrics#641
feat(e2e-tests): stacked e2e after split metrics#641davidberenstein1957 wants to merge 1 commit intofeat/vlm-pr-4c-img-edit-scorefrom
Conversation
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 4 potential issues.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Comment @cursor review or bugbot run to trigger another review on this PR
Reviewed by Cursor Bugbot for commit 7f24f9d. Configure here.
| "peft>=0.18.0,<0.19.0", | ||
| "trl<=0.21.0", | ||
| "termcolor==2.3.0", | ||
| "realesrgan", |
There was a problem hiding this comment.
Package realesrgan moved from optional to core dependency
High Severity
realesrgan was moved from the [upscale] optional dependency group into core dependencies, and the [upscale] extra was deleted entirely. This forces every user to install realesrgan and its heavy transitive dependencies (basicsr, facexlib, gfpgan, etc.) even if they never use upscaling. This PR is about VLM e2e tests and has no reason to change this. Likely an accidental inclusion from a rebase or merge.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 7f24f9d. Configure here.
| [project] | ||
| name = "pruna" | ||
| version = "0.3.3" | ||
| version = "0.3.2" |
There was a problem hiding this comment.
Version downgraded and Python 3.13 support dropped
High Severity
version was downgraded from "0.3.3" to "0.3.2" and requires-python was tightened from ">=3.10,<3.14" to ">=3.10,<3.13", dropping Python 3.13 support. The PR description says "pyproject.toml — Already updated in PR-2", suggesting these regressions were accidentally included during a rebase or merge conflict resolution.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 7f24f9d. Configure here.
| evaluation = [ | ||
| "outlines>1.2.0,<2.0.0", | ||
| "litellm>=1.0.0", | ||
| ] |
There was a problem hiding this comment.
evaluation extra silently drops lmharness and rapidata
Medium Severity
The [evaluation] optional extra was redefined from ["pruna[rapidata]", "pruna[lmharness]"] to ["outlines>1.2.0,<2.0.0", "litellm>=1.0.0"]. Users running pip install pruna[evaluation] will no longer get lm-eval or rapidata. The [rapidata] extra was also completely removed. This is a silent backward-incompatible change to the package's public install interface.
Reviewed by Cursor Bugbot for commit 7f24f9d. Configure here.
| {"img_size": 224}, | ||
| ), | ||
| "DrawBench": (setup_drawbench_dataset, "prompt_collate", {}), | ||
| "DrawBench": (setup_drawbench_dataset, "prompt_with_auxiliaries_collate", {}), |
There was a problem hiding this comment.
DrawBench/GenAIBench collate change alters return type
Medium Severity
DrawBench and GenAIBench collate functions changed from prompt_collate (returns (prompts, None)) to prompt_with_auxiliaries_collate (returns (prompts, list[dict])). Any existing code consuming these datasets and expecting gt=None (e.g., model inference handlers, metric update calls that check for None ground truth) will now receive a list of dicts, potentially causing unexpected behavior.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 7f24f9d. Configure here.
- Add _vlm_batch_snapshot_helpers for test data generation - Add end-to-end tests for metric interactions - Add datamodule support for VLM evaluation - Add task-level VLM metric integration - Add VLM timing/profiling support - Strip VLM task routing kwargs in TorchMetricWrapper - Update docs with VLM evaluation guide - Update data loaders for image/caption support - Integration with evaluation agent for VLM metric selection
7f24f9d to
a45d5ac
Compare


Summary
Test plan
uv run pytest tests/evaluation/test_vlm_e2e.py tests/evaluation/test_task.py