This repository contains the code and methodology used to curate a visually enhanced version of the classic MNIST dataset.
The curated dataset adds an IDK (“I Don’t Know”) class for ambiguous, noisy, or hard-to-classify digits, enabling research on robust classification and uncertainty handling.
👉 Curated dataset on Hugging Face:
https://huggingface.co/datasets/YOUR_DATASET_NAME
👉 This repository: Code for data curation, embedding visualization, IDK generation, and classifier training.
The curated MNIST dataset:
- keeps the original 10 digit classes (0–9)
- adds an 11th class:
IDK - relabels visually ambiguous, distorted, or extremely noisy samples as
IDK - is designed for:
- robust classification experiments
- out-of-distribution detection
- dataset curation workflows
- training models that can abstain (“I don’t know”)
Standard MNIST contains many ambiguous digits:
- spaghetti-like “9”
- fuzzy or low-contrast strokes
- digits written strangely or partially cropped
- digits consistently misclassified by vanilla LeNet
These images reduce model reliability and complicate evaluation.
The goal of this project is to:
- Identify such problematic samples
- Relabel them as
IDK - Retrain a classifier on the improved dataset
- Compare the baseline vs IDK-aware model
The curation process uses FiftyOne, UMAP, PCA, and Brain metrics.
Used to extract embeddings and detect misclassified samples.
Using:
- PCA
- UMAP
- FiftyOne’s interactive embedding view
hardness– how often the model hesitatesmistakenness– samples frequently misclassifieduniqueness– outliers in embedding spacerepresentativeness– typicality within the dataset
This reveals clusters of ambiguous digits and strong outliers.
Example UMAP visualization:
Suspicious samples detected by the above metrics were visually inspected:
- ambiguous shapes
- messy handwriting
- conflicting class evidence
- strong outliers
- mislabeled samples
These were relabeled as IDK.
If you use this curated dataset or code, please cite the original MNIST paper:
@article{lecun1998gradient,
title={Gradient-based learning applied to document recognition},
author={LeCun, Yann and Bottou, L{\'e}on and Bengio, Yoshua and Haffner, Patrick},
journal={Proceedings of the IEEE},
volume={86},
number={11},
pages={2278--2324},
year={1998},
publisher={IEEE}
}
