Handle PIL UnidentifiedImageError exception when running cleanvision on local image folder dataset by pavansai018 · Pull Request #263 · cleanlab/cleanvision

pavansai018 · 2025-12-15T13:44:06Z

Summary

This PR ensures that corrupted / unreadable images are filtered out at dataset construction time, so they never appear in dataset.index.

This prevents runtime crashes in downstream code paths such as find_issues() and visualize(), which assume that every index corresponds to a readable image.

Fixes #222

Motivation

Currently, dataset indices are created purely from discovered filepaths (or integer indices for torchvision datasets), without checking whether the underlying image data is actually readable.

As a result:

Corrupted images can enter dataset.index
Errors surface later during visualization or issue detection
Failures occur far away from the root cause and are hard to recover from

Since visualize() and issue managers materialize indices before accessing images, lazy handling in __getitem__ is insufficient.

The correct place to handle this is before the index is finalized.

What this PR changes

File-based datasets (`FSDataset`)

During filepath discovery, each image is opened once to check integrity
Corrupted images are silently skipped
Only valid image paths are included in _filepaths and dataset.index

TorchVision datasets (`TorchDataset`)

Dataset indices are validated once during _set_index()
Any sample whose image cannot be accessed or is not a valid PIL.Image is excluded
dataset.index contains only readable samples

What this PR does NOT do

❌ No changes to visualize() or issue managers
❌ No None propagation or sentinel values
❌ No behavior changes for valid datasets

All downstream code continues to rely on the existing invariant:

Every index in dataset.index maps to a valid image

Performance considerations

Each image/sample is checked once at dataset construction time
This is unavoidable if dataset membership must exclude corrupted entries
For valid images, there is no additional overhead during processing or visualization

Result

dataset.index is always consistent
Visualization never crashes due to corrupted images
Errors are handled at the correct architectural boundary

This makes dataset handling more robust while keeping the rest of the codebase unchanged.

…on local image folder dataset

CLAassistant · 2025-12-15T13:44:18Z

All committers have signed the CLA.

Handle PIL UnidentifiedImageError exception when running cleanvision …

fed6adb

…on local image folder dataset

pavansai018 force-pushed the log-corrupt-image branch from bde7810 to fed6adb Compare December 16, 2025 03:32

Merge branch 'main' into log-corrupt-image

19442cc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle PIL UnidentifiedImageError exception when running cleanvision on local image folder dataset#263

Handle PIL UnidentifiedImageError exception when running cleanvision on local image folder dataset#263
pavansai018 wants to merge 2 commits intocleanlab:mainfrom
pavansai018:log-corrupt-image

pavansai018 commented Dec 15, 2025

Uh oh!

CLAassistant commented Dec 15, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

pavansai018 commented Dec 15, 2025

Summary

Fixes #222

Motivation

What this PR changes

File-based datasets (FSDataset)

TorchVision datasets (TorchDataset)

What this PR does NOT do

Performance considerations

Result

Uh oh!

CLAassistant commented Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

File-based datasets (`FSDataset`)

TorchVision datasets (`TorchDataset`)

CLAassistant commented Dec 15, 2025 •

edited

Loading