English | 简体中文
PaddleOCR-VL, a SOTA and resource-efficient model tailored for document parsing. Its core component is PaddleOCR-VL-0.9B, a compact yet powerful vision-language model (VLM) that integrates a NaViT-style dynamic resolution visual encoder with the ERNIE-4.5-0.3B language model to enable accurate element recognition. This innovative model efficiently supports 109 languages and excels in recognizing complex elements (e.g., text, tables, formulas, and charts), while maintaining minimal resource consumption. Through comprehensive evaluations on widely used public benchmarks and in-house benchmarks, PaddleOCR-VL achieves SOTA performance in both page-level document parsing and element-level recognition. It significantly outperforms existing solutions, exhibits strong competitiveness against top-tier VLMs, and delivers fast inference speeds. These strengths make it highly suitable for practical deployment in real-world scenarios.
While PaddleOCR-VL-0.9B excels in common scenarios, its performance often faces limitations in many specific or complex business applications. For instance:
- Domain-Specific Applications
- Finance & Accounting: Recognizing documents such as invoices, receipts, bank statements, and financial reports
- Healthcare: Processing medical records, lab reports, handwritten prescriptions, and pharmaceutical instructions
- Legal Sector: Identifying text in contracts, legal instruments, court filings, and certificates.
- Non-Standard Text and Typography
- Handwriting Recognition: Deciphering handwritten forms, notes, letters, and questionnaires.
- Stylized & Artistic Fonts: Recognizing text on posters, billboards, product packaging, and menus.
- Historical & Archival Documents: Processing ancient manuscripts, old newspapers, and historical archives.
- Task-Specific Structured Output
- Table Recognition & Structuring: Converting tables within images into structured formats like Excel, CSV, or JSON.
- Mathematical Formula Recognition: Identifying mathematical equations in textbooks or research papers and exporting them into formats like LaTeX.
This is where SFT (Supervised Fine-Tuning) becomes necessary to enhance the model’s accuracy and robustness for these specialized tasks.
Please ensure that you install ERNIE and its related dependencies in an environment with CUDA 12 or a later version. To avoid potential environment issues, we recommend building a container based on the official PaddlePaddle image.
The image already includes the PaddlePaddle framework, so no additional installation is required.
docker run --gpus all --name erniekit-ft-paddleocr-vl -v $PWD:/paddle --shm-size=128g --network=host -it ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:3.2.0-gpu-cuda12.6-cudnn9.5 /bin/bashClone ERNIEKit and install dependencies:
git clone https://github.com/PaddlePaddle/ERNIE
cd ERNIE
python -m pip install -r requirements/gpu/requirements.txt
python -m pip install -e .
python -m pip install tensorboard
python -m pip install opencv-python-headless
python -m pip install numpy==1.26.4For more installation methods, please refer to the ERNIEKit Installation Guide.
The PaddleOCR-VL-0.9B model can be downloaded from huggingface or modelscope.
huggingface-cli download PaddlePaddle/PaddleOCR-VL --local-dir PaddlePaddle/PaddleOCR-VLFor the training dataset format, please refer to SFT VL Dataset Format. Required fields are as follows:
text_info: The list of text data, each element contains atextand atagtext: The text content from User question or System responsetag: The mask tag (no_mask=include in training,mask=exclude)
image_info: The list of image data, each element contains aimage_urland amatched_text_indeximage_url: The url to download image online or the path to access image locallymatched_text_index: The index of matched text intext_info- Default:
matched_text_index=0means the image is matched with the first text, and will be palced before the first text
- Default:
Notes:
- Each training sample is in JSON format, with multiple samples separated by newlines
- Please ensure that
maskitems andno_maskitems alternate in thetext_info
For your convenience, we also provide a quick-start Bengali training dataset for fine-tuning PaddleOCR-VL-0.9B on Bengali recognition. Download it using the following command:
wget https://paddleformers.bj.bcebos.com/datasets/ocr_vl_sft-train_Bengali.jsonlBengali training example:
{
"image_info": [
{"matched_text_index": 0, "image_url": "./assets/table_example.jps"},
],
"text_info": [
{"text": "OCR:", "tag": "mask"},
{"text": "দডর মথ বধ বকসট একনজর দখই চনত পরল তর অনমন\nঠক পনতই লকয রখছ\nর নচ থকই চচয বলল কশর, “এইই; পযছ! পযছ!'\nওপর", "tag": "no_mask"},
]
}Tables, formulas, and charts use a special data format. For details, please refer to 8.1. Table/Formula/Chart Data Format
We provide a configuration file for the Bengali sample dataset. The key training hyperparameters are as follows:
max_steps=926: Total number of training steps, approximately(D × E) / (G × B × A).D: Number of training samples in the dataset.E: Number of training epochs.G: Number of GPUs used for training.B: Batch size per GPU per step.A: Number of gradient accumulation steps.
warmup_steps=10: Number of linear warmup steps. It is recommended to set this to 1% of max_steps (0.01 × max_steps).packing_size=8: Number of samples packed into a sequence. Its effect is functionally equivalent to batch_size.max_seq_len=16384: The maximum sequence length. It’s recommended to set this to the largest value that your GPU memory can accommodate during training.gradient_accumulation_steps=8: Number of gradient accumulation steps.- Model parameters are updated once every
gradient_accumulation_steps. - When GPU memory is insufficient, you can decrease
packing_sizeand increasegradient_accumulation_steps. - This is a time-for-space tradeoff: it reduces GPU memory usage but extends training time.
- Model parameters are updated once every
learning_rate=5e-6: Learning rate, which determines the magnitude of each parameter update.
Start the training using the following command:
CUDA_VISIBLE_DEVICES=0 \
erniekit train examples/configs/PaddleOCR-VL/sft/run_ocr_vl_sft_16k.yaml \
model_name_or_path=PaddlePaddle/PaddleOCR-VL \
train_dataset_path=./ocr_vl_sft-train_Bengali.jsonl \The training takes approximately 2 hours on a single A800-80G GPU.
By default, ERNIEKit uses all available GPUs on the machine. You can specify which GPUs ERNIEKit can use with the CUDA_VISIBLE_DEVICES environment variable.
The number of GPUs GPU_num affects the configuration of training hyperparameters like learning_rate, packing_size, and gradient_accumulation_steps. Theoretically, the number of samples used per update step, sample_num = G*B*A, has an approximately linear relationship with the learning_rate. Therefore, when the number of GPUs increases by a factor of N (to N*GPU), there are two adjustment methods:
- Keep sample_num constant:
- Decrease
packing_sizeby a factor ofxtopacking_size/x. - Decrease
gradient_accumulation_stepsby a factor ofytogradient_accumulation_steps/y. - Where
x * y = N.
- Decrease
- Increase
learning_rateby a factor ofNtoN*learning_rate.
You can visualize the training process using tensorboard. Launch it with the following command (the command below sets the port to 8084; please adjust it to an available port as needed):
tensorboard --logdir /PaddleOCR-VL-SFT-Bengali/tensorboard_logs/ --port 8084After the service starts successfully, you can view the training logs by entering ip:port in your browser (You can find the machine’s IP address using the hostname -i command).
Loss curve as follows:
After training, the model will be saved in the path specified by output_dir=./PaddleOCR-VL-SFT-Bengali. The directory contains:
- preprocessor_config.json: Image preprocessing configuration file.
- config.json: Model configuration file.
- model-00001-of-00001.safetensors: Model weights file.
- The format of the saved model can be controlled by
save_to_hf, defaulting to the Hugging Face safetensors format.
- The format of the saved model can be controlled by
- model.safetensors.index.json & static_name_to_dyg_name.json: Model weight index files, etc., used to assist in sharding and loading the model across multiple GPUs.
- tokenizer.model & tokenizer_config.json & special_tokens_map.json & added_tokens.json: Tokenizer files.
- train_args.bin: Training arguments file, which records the parameters used for training.
- train_state.json: Training state file, which records the training step and best metrics.
- train_results.json & all_results.json: Training results files, which record training progress, duration, time per step, time per sample, etc.
- generation.json: Generation configuration file.
- checkpoint-[save_steps*n]: Checkpoint folders. Saves the training state at multiples of
save_steps. In addition to the files above, it also saves master-weight, optimizer-state, scheduler-state, etc., which can be used to resume training after an interruption.
Install PaddleOCR for inference:
python -m pip install -U "paddleocr[doc-parser]"
python -m pip install https://paddle-whl.bj.bcebos.com/nightly/cu126/safetensors/safetensors-0.6.2.dev0-cp38-abi3-linux_x86_64.whl
python -m pip install --force-reinstall opencv-python-headless
python -m pip install numpy==1.26.4Copy the necessary inference configuration files from the original PaddleOCR-VL model to the directory where the SFT-trained model is saved:
cp PaddlePaddle/PaddleOCR-VL/chat_template.jinja PaddleOCR-VL-SFT-Bengali
cp PaddlePaddle/PaddleOCR-VL/inference.yml PaddleOCR-VL-SFT-Bengali
We provide a Bengali test dataset that can be used for inference to observe the fine-tuning results. Download it using the following command:
wget https://paddleformers.bj.bcebos.com/datasets/ocr_vl_sft-test_Bengali.jsonlBengali test image:
Use the following command for single-sample inference:
paddleocr doc_parser -i https://paddle-model-ecology.bj.bcebos.com/PPOCRVL/dataset/bengali_sft/5b/7a/5b7a5c1c-207a-4924-b5f3-82890dc7b94a.png \
--vl_rec_model_name "PaddleOCR-VL-0.9B" \
--vl_rec_model_dir "./PaddleOCR-VL-SFT-Bengali" \
--save_path="./PaddleOCR-VL-SFT-Bengali_response"
# GT = নট চলল রফযনর পঠ সওযর\nহয গলয গলয ভব এখন দটত, মঝ মঝ খবর নয যদও লগ যয\nঝগড\nদরগর কছ চল এল
# Excepted Answer = নট চলল রফযনর পঠ সওযর\nহয গলয গলয ভব এখন দটত, মঝ মঝ খবর নয যদও লগ যয\nঝগড\nদরগর কছ চল এলThe above command will save the results and visualization images in the PaddleOCR-VL-SFT-Bengali_response directory, where the prediction results are stored in files with the .md extension. For more information on the inference capabilities of the paddleocr tool, please refer to: https://www.paddleocr.ai/latest/version3.x/pipeline_usage/PaddleOCR-VL.html.
In particular, the following formats are used for specific data types:
Table Data: OTSL format
{
"image_info": [
{"matched_text_index": 0, "image_url": "./assets/table_example.jps"},
],
"text_info": [
{"text": "Table Recognition:", "tag": "mask"},
{"text": "<fcel>分组<fcel>频数<fcel>频率<nl><fcel>[41,51)<fcel>2<fcel>\\( \\frac{2}{30} \\)<nl><fcel>[51,61)<fcel>1<fcel>\\( \\frac{1}{30} \\)<nl><fcel>[61,71)<fcel>4<fcel>\\( \\frac{4}{30} \\)<nl><fcel>[71,81)<fcel>6<fcel>\\( \\frac{6}{30} \\)<nl><fcel>[81,91)<fcel>10<fcel>\\( \\frac{10}{30} \\)<nl><fcel>[91,101)<fcel>5<fcel>\\( \\frac{5}{30} \\)<nl><fcel>[101,111)<fcel>2<fcel>\\( \\frac{2}{30} \\)<nl>", "tag": "no_mask"},
]
}Formula Data: LaTeX format
{
"image_info": [
{"matched_text_index": 0, "image_url": "./assets/formula_example.jps"},
],
"text_info": [
{"text": "Formula Recognition:", "tag": "mask"},
{"text": "\\[t_{n}\\in[0,\\infty]\\]", "tag": "no_mask"},
]
}Chart Data: Markdown format
{
"image_info": [
{"matched_text_index": 0, "image_url": "./assets/chart_example.png"},
],
"text_info": [
{"text": "Chart Recognition:", "tag": "mask"},
{"text": " | 22Q3 | 22Q3yoy\n电商 | 85 | 100%\n川渝 | 140 | 8%\n云贵陕 | 95 | 12%\n外围地区 | 45 | 20%", "tag": "no_mask"},
]
}If you encounter the following problem while using the above command, it is generally due to a conflict between cv2 and the environment. This can be resolved by installing opencv-python-headless.
Error message
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.10/dist-packages/cv2/__init__.py", line 181, in <module>
bootstrap()
File "/usr/local/lib/python3.10/dist-packages/cv2/__init__.py", line 153, in bootstrap
native_module = importlib.import_module("cv2")
File "/usr/lib/python3.10/importlib/__init__.py", line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
ImportError: libGL.so.1: cannot open shared object file: No such file or directory
Solution
python -m pip install --force-reinstall opencv-python-headless
python -m pip install numpy==1.26.4




