Benchmarks the speed of reading JPEG images and converting them to RGB numpy arrays across popular Python libraries. Targets machine learning training pipelines, measured across multiple CPU architectures (Intel Xeon, AMD EPYC, ARM Neoverse, Apple M-series) using the ImageNet validation set.
The plots and tables below are generated from output/<platform>/*.json. To refresh after a new run:
imread-benchmark plot --input output --output docs/assets/benchmarks
imread-benchmark render-readmeThe plot labels show img/s and % of the fastest decoder on that CPU, so darker cells are the winners for that platform.
Pure decode speed with one thread, bytes pre-loaded to memory. Bold = best per platform.
| Library | AMD EPYC 9B14 | AMD EPYC 9B45 | Intel Xeon Platinum 8581C | Neoverse-N1 | Neoverse-V2 |
|---|---|---|---|---|---|
simplejpeg |
690 | 857 | 735 | 456 | 662 |
turbojpeg |
640 | 818 | 708 | 426 | 613 |
jpeg4py |
636 | 760 | 699 | 423 | 611 |
kornia-rs |
642 | 761 | 664 | 391 | 629 |
opencv |
664 | 841 | 721 | 445 | 645 |
imagecodecs |
677 | 775 | 723 | 457 | 661 |
pyvips |
420 | 586 | 462 | 261 | 413 |
pillow |
537 | 726 | 577 | 360 | 551 |
skimage |
475 | 661 | 525 | 326 | 499 |
imageio |
496 | 599 | 524 | 335 | 506 |
torchvision |
621 | 864 | 712 | 440 | 643 |
tensorflow |
596 | 836 | 689 | 268 | 391 |
Best images_per_second across num_workers ∈ {0, 2, 4, 8} for each library × platform, using a PyTorch DataLoader with batch_size=32. Cell format: img/s @ Nw. Bold = best per platform.
| Library | AMD EPYC 9B14 | AMD EPYC 9B45 | Intel Xeon Platinum 8581C | Neoverse-N1 | Neoverse-V2 |
|---|---|---|---|---|---|
simplejpeg |
1,521 @ 4w | 2,739 @ 8w | 1,754 @ 8w | 1,557 @ 8w | 2,421 @ 8w |
turbojpeg |
1,535 @ 4w | 2,800 @ 8w | 1,710 @ 8w | 1,347 @ 4w | 2,389 @ 8w |
jpeg4py |
1,443 @ 4w | 2,453 @ 8w | 1,651 @ 8w | 1,411 @ 8w | 2,312 @ 8w |
kornia-rs |
1,327 @ 8w | 2,394 @ 8w | 1,422 @ 8w | 1,260 @ 8w | 1,951 @ 8w |
opencv |
1,457 @ 4w | 2,814 @ 8w | 1,707 @ 8w | 1,419 @ 8w | 2,414 @ 8w |
imagecodecs |
1,543 @ 4w | 2,476 @ 8w | 1,677 @ 8w | 1,443 @ 8w | 2,242 @ 8w |
pillow |
1,283 @ 4w | 2,465 @ 8w | 1,565 @ 8w | 1,387 @ 8w | 2,350 @ 8w |
skimage |
1,238 @ 4w | 2,536 @ 8w | 1,615 @ 8w | 1,388 @ 8w | 2,315 @ 8w |
imageio |
1,273 @ 4w | 2,324 @ 8w | 1,643 @ 8w | 1,466 @ 8w | 2,561 @ 8w |
torchvision |
1,596 @ 8w | 2,920 @ 8w | 1,612 @ 4w | 1,504 @ 8w | 2,557 @ 8w |
5 platforms · 50,000 images · 5 runs each · latest run 2026-04-22
All decoders output (H, W, 3) uint8 RGB numpy arrays for a fair comparison. Libraries that default to other formats (OpenCV → BGR, torchvision → CHW tensor, TensorFlow → EagerTensor) include a conversion step. Note that in real ML pipelines the conversion is often unnecessary.
Memory mode (default): images are pre-loaded as bytes before the timed loop. This measures pure decode throughput with no disk I/O.
Disk mode: each decode call reads the file from disk. Includes I/O latency.
ImageNet validation set — 50,000 JPEG images, ~500×400px.
# Download
wget https://image-net.org/data/ILSVRC/2012/ILSVRC2012_img_val.tar
mkdir -p imagenet/val
tar -xf ILSVRC2012_img_val.tar -C imagenet/valbrew install jpeg-turbo # required by PyTurboJPEG (pure-python ctypes binding)pyvips ships its own bundled libvips via the pyvips-binary PyPI wheel,
so no brew install vips is needed. simplejpeg wheels bundle libjpeg-turbo.
On Linux you'll still need apt install libjpeg-turbo8-dev libturbojpeg0
(see gcp/vm_startup.sh), since jpeg4py is built from sdist.
# Install uv if needed
pip install uv
# Install the orchestrator (control-plane) into a venv.
# Per-library worker venvs (mainstream / tensorflow) are created lazily on
# first run, with the right libjpeg-turbo / libvips deps.
uv venv && source .venv/bin/activate
uv pip install -e .# What would run on this machine?
imread-benchmark list-libs
# Single + DataLoader for every supported decoder, default 50k images
imread-benchmark run --data-dir /path/to/imagenet/val
# Faster smoke run
imread-benchmark run --data-dir /path/to/imagenet/val \
--num-images 2000 --num-runs 5 --dataloader-runs 2 \
--workers 0,2
# Just one library, single-thread benchmark only
imread-benchmark run --data-dir /path/to/imagenet/val \
--libs opencv --mode single
# Generate README plots from output/ JSONs
imread-benchmark plot --input output --output docs/assets/benchmarksThe CLI sets up venvs/<group>/ for each dependency group it needs. Subsequent runs reuse those venvs, so only the first invocation pays the install cost.
Spin up a benchmark VM on GCP, run everything against ImageNet from a GCS bucket, and have it self-delete when done:
./gcp/run.sh \
--imagenet-bucket gs://my-bucket/imagenet/val \
--results-bucket gs://my-bucket/imread-results \
--no-waitBuilt venvs are cached in GCS (keyed by sha256(uv.lock)), so reruns on the same machine type skip the ~25-minute install. Use --force-rebuild to re-resolve PyPI without editing uv.lock. Full details, machine-type matrix, cost, and cache semantics: docs/gcp_benchmarks.md.
output/
└── darwin_Apple-M4-Max/
├── opencv_results.json
├── pillow_results.json
├── opencv_dataloader_results.json
└── ...
- simplejpeg — CFFI binding; zero-copy decode from bytes
- turbojpeg (PyTurboJPEG) — Python binding for libjpeg-turbo
- jpeg4py — direct libjpeg-turbo binding (Linux only)
- kornia-rs — Rust implementation using libjpeg-turbo
- OpenCV (opencv-python-headless)
- imagecodecs — uses libjpeg-turbo 3.x; prebuilt ARM64 wheels
- pyvips — libvips bindings (bundled in wheels). Single-thread only; the libvips threadpool deadlocks under fork-based PyTorch DataLoader, so dataloader benchmarks are skipped on every platform.
- Pillow
- scikit-image
- imageio
Note: Pillow-SIMD was previously included but dropped 2026-04 — upstream is abandoned (last release 2023-05), no Linux wheels, and its historical SIMD speedup is now matched by
jpeg4py/simplejpeg/kornia-rs. Full rationale indocs/gcp_benchmarks.md.
- torchvision
- tensorflow
- All benchmarks run single-threaded unless using the DataLoader benchmark
- Memory mode is the recommended baseline — it isolates decode speed from storage
- Results based on ImageNet JPEG images (~500×400px)
- Use
simplejpeg,turbojpeg, orkornia-rsfor maximum single-thread decode speed - Use the DataLoader benchmark to find the best
num_workersfor your CPU
kornia-rsandopencvoffer the most consistent cross-platform performance
opencvremains the best choice when you need more than just JPEG decoding
# Run tests
uv run pytest tests/ -v
# Run linters
uv run pre-commit run --all-filesSee CONTRIBUTING.md for how to add a new decoder.
If you found this work useful, please cite:
@misc{iglovikov2025speed,
title={Need for Speed: A Comprehensive Benchmark of JPEG Decoders in Python},
author={Vladimir Iglovikov},
year={2025},
eprint={2501.13131},
archivePrefix={arXiv},
primaryClass={eess.IV},
doi={10.48550/arXiv.2501.13131}
}
