Lightweight on-device image captioning system based on MobileVLM, supporting PyTorch, ONNX, and iOS deployment.
This project implements an image captioning pipeline designed for deployment in resource-constrained environments.
- Supports multiple models: BLIP, InstructBLIP, and MobileVLM v2
- Supports PyTorch, ONNX, and Core ML deployment for MobileVLM v2
- Lightweight on-device image captioning
- MobileVLM v2 support
- PyTorch / ONNX / Core ML inference
- iOS deployment support
- Edge-device optimization
The following pretrained models from Hugging Face are used in this project:
- BLIP: "Salesforce/blip-image-captioning-base"
- InstructBLIP: "Salesforce/instructblip-flan-t5-xl"
- MobileVLM: "mtgv/MobileVLM_V2-1.7B"
Due to their large size (~6GB), ONNX model weights are not included in this repository.
Instead, you can generate them locally using the provided export script:
python src/mobilevlm/export/export_onnx/export_*.pyDue to their large size (~6GB), Core ML model weights are not included in this repository.
Instead, you can generate them locally using the provided export script:
python src/mobilevlm/export/export_coreml/export_*.pyhttps://github.com/kc-ml2/captioning_edgedevice/blob/main/document/how_to_run.md
sample.jpg
Question:
"What objects are visible in the scene?"
"In the image, there is a living room with a fireplace, a television, a table, chairs, and a woman standing in the kitchen."
Load Latency
The model server loads the model once at startup and performs an initial warm-up pass.
VRAM Usage
GPU (NVIDIA TITAN V) memory usage is measured after the model is fully loaded and warmed up.
Inference Latency
Image captioning latency per image is measured over 500 randomly sampled images from COCO val2017.
| Model | Precision | Load Latency | VRAM Usage | Inference Latency |
|---|---|---|---|---|
| BLIP base | FP16 | 5 s | 0.9 GiB | 0.6 s |
| InstructBLIP | FP16 | 12 s | 10.7 GiB | 2.0 s |
| InstructBLIP | Hybrid (INT4 LLM) | 15 s | 6.9 GiB | 2.7 s |
| InstructBLIP | INT4 | 20 s | 5.0 GiB | 2.7 s |
Latency breakdown for a single image inference.
The current Xcode/Core ML implementation is not fully optimized.
| Runtime | Preprocessing | Vision Encoder | Projector | LLM (40 tkn) | Total |
|---|---|---|---|---|---|
| GPU | - | - | - | - | 1.7 sec |
| CPU (Python) | 0.04 sec | 0.52 sec | 0.01 sec | 6.33 sec | 6.90 sec |
| CPU (ONNX) | 0.03 sec | 0.88 sec | 0.02 sec | 3.77 sec | 4.70 sec |
| iOS (Core ML) | 0.12 sec | 1.27 sec | 0.02 sec | 6.10 sec | 7.51 sec |
This project is partially based on the official MobileVLM repository: https://github.com/Meituan-AutoML/MobileVLM
We adapted and modified the original implementation for:
- PyTorch runtime
- ONNX deployment
- Core ML deployment
- On-device inference optimization
Copyright (c) ML2. All rights reserved.
This project includes code derived from the MobileVLM repository, which is licensed under the Apache License 2.0.
The official PyTorch-based MobileVLM was modified and exported to ONNX and Core ML for on-device inference.

