On-Device Image Captioning

Lightweight on-device image captioning system based on MobileVLM, supporting PyTorch, ONNX, and iOS deployment.

Overview

This project implements an image captioning pipeline designed for deployment in resource-constrained environments.

Supports multiple models: BLIP, InstructBLIP, and MobileVLM v2
Supports PyTorch, ONNX, and Core ML deployment for MobileVLM v2

Features

Lightweight on-device image captioning
MobileVLM v2 support
PyTorch / ONNX / Core ML inference
iOS deployment support
Edge-device optimization

Model Weights

PyTorch

The following pretrained models from Hugging Face are used in this project:

BLIP: "Salesforce/blip-image-captioning-base"
InstructBLIP: "Salesforce/instructblip-flan-t5-xl"
MobileVLM: "mtgv/MobileVLM_V2-1.7B"

ONNX

Due to their large size (~6GB), ONNX model weights are not included in this repository.

Instead, you can generate them locally using the provided export script:

python src/mobilevlm/export/export_onnx/export_*.py

Core ML

Due to their large size (~6GB), Core ML model weights are not included in this repository.

Instead, you can generate them locally using the provided export script:

python src/mobilevlm/export/export_coreml/export_*.py

Running MobileVLM

https://github.com/kc-ml2/captioning_edgedevice/blob/main/document/how_to_run.md

Result

Input

sample.jpg

Question:
"What objects are visible in the scene?"

Generated Caption

"In the image, there is a living room with a fireplace, a television, a table, chairs, and a woman standing in the kitchen."

Performance and Latency

Load Latency
The model server loads the model once at startup and performs an initial warm-up pass.

VRAM Usage
GPU (NVIDIA TITAN V) memory usage is measured after the model is fully loaded and warmed up.

Inference Latency
Image captioning latency per image is measured over 500 randomly sampled images from COCO val2017.

Model Comparison

Model	Precision	Load Latency	VRAM Usage	Inference Latency
BLIP base	FP16	5 s	0.9 GiB	0.6 s
InstructBLIP	FP16	12 s	10.7 GiB	2.0 s
InstructBLIP	Hybrid (INT4 LLM)	15 s	6.9 GiB	2.7 s
InstructBLIP	INT4	20 s	5.0 GiB	2.7 s

MobileVLM CPU Runtime Breakdown

Latency breakdown for a single image inference.

The current Xcode/Core ML implementation is not fully optimized.

Runtime	Preprocessing	Vision Encoder	Projector	LLM (40 tkn)	Total
GPU	-	-	-	-	1.7 sec
CPU (Python)	0.04 sec	0.52 sec	0.01 sec	6.33 sec	6.90 sec
CPU (ONNX)	0.03 sec	0.88 sec	0.02 sec	3.77 sec	4.70 sec
iOS (Core ML)	0.12 sec	1.27 sec	0.02 sec	6.10 sec	7.51 sec

Acknowledgement

This project is partially based on the official MobileVLM repository: https://github.com/Meituan-AutoML/MobileVLM

We adapted and modified the original implementation for:

PyTorch runtime
ONNX deployment
Core ML deployment
On-device inference optimization

License

This project includes code derived from the MobileVLM repository, which is licensed under the Apache License 2.0.

The official PyTorch-based MobileVLM was modified and exported to ONNX and Core ML for on-device inference.

Name		Name	Last commit message	Last commit date
Latest commit History 181 Commits
assets		assets
document		document
src		src
.DS_Store		.DS_Store
README.md		README.md
requirements.txt		requirements.txt
sample.jpg		sample.jpg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

On-Device Image Captioning

Overview

Features

Model Weights

PyTorch

ONNX

Core ML

Running MobileVLM

Result

Input

Generated Caption

Performance and Latency

Model Comparison

MobileVLM CPU Runtime Breakdown

Acknowledgement

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

On-Device Image Captioning

Overview

Features

Model Weights

PyTorch

ONNX

Core ML

Running MobileVLM

Result

Input

Generated Caption

Performance and Latency

Model Comparison

MobileVLM CPU Runtime Breakdown

Acknowledgement

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages