Handle PIL UnidentifiedImageError exception when running cleanvision on local image folder dataset#263
Open
pavansai018 wants to merge 2 commits intocleanlab:mainfrom
Open
Handle PIL UnidentifiedImageError exception when running cleanvision on local image folder dataset#263pavansai018 wants to merge 2 commits intocleanlab:mainfrom
pavansai018 wants to merge 2 commits intocleanlab:mainfrom
Conversation
…on local image folder dataset
bde7810 to
fed6adb
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR ensures that corrupted / unreadable images are filtered out at dataset construction time, so they never appear in
dataset.index.This prevents runtime crashes in downstream code paths such as
find_issues()andvisualize(), which assume that every index corresponds to a readable image.Fixes #222
Motivation
Currently, dataset indices are created purely from discovered filepaths (or integer indices for torchvision datasets), without checking whether the underlying image data is actually readable.
As a result:
dataset.indexSince
visualize()and issue managers materialize indices before accessing images, lazy handling in__getitem__is insufficient.The correct place to handle this is before the index is finalized.
What this PR changes
File-based datasets (
FSDataset)_filepathsanddataset.indexTorchVision datasets (
TorchDataset)_set_index()PIL.Imageis excludeddataset.indexcontains only readable samplesWhat this PR does NOT do
visualize()or issue managersNonepropagation or sentinel valuesAll downstream code continues to rely on the existing invariant:
Performance considerations
Result
dataset.indexis always consistentThis makes dataset handling more robust while keeping the rest of the codebase unchanged.