Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
56 commits
Select commit Hold shift + click to select a range
e6bed84
feat: add DeepSeek-OCR-2 model registration and docs
thisisiron Mar 18, 2026
7a1a400
feat: add DeepseekOcr2ImageProcessor
thisisiron Mar 18, 2026
f3ec285
feat: add DeepseekOcr2Processor and refactor image processor tile_size
thisisiron Mar 23, 2026
95b0d57
feat: add script to convert DeepSeek-OCR-2 weights to Hugging Face fo…
thisisiron Mar 23, 2026
b81bc0f
feat: enhance DeepSeek-OCR-2 processing and inference test
thisisiron Mar 23, 2026
bbf3f94
refactor
thisisiron Mar 23, 2026
2e05944
feat: add fast image processor
thisisiron Mar 27, 2026
8f3270b
refactor
thisisiron Mar 27, 2026
3ee14eb
test: add image processor tests for DeepseekOcr2
thisisiron Mar 27, 2026
58b683e
fix: make background_color configurable
thisisiron Mar 27, 2026
3710499
refactor: migrate image processors to pil/torchvision backend pattern
thisisiron Mar 27, 2026
1bfa054
test: add processor tests for DeepseekOcr2
thisisiron Mar 28, 2026
1878af2
fix: update __init__
thisisiron Mar 28, 2026
46ddaec
chore: clean up unused imports and fix formatting
thisisiron Mar 28, 2026
8754d8d
feat: add configuration, modeling, and modular for DeepseekOcr2
thisisiron Mar 28, 2026
fef7b5f
fix: style fixes, update docs, and minor cleanups
thisisiron Mar 28, 2026
5520c31
fix: use @strict
thisisiron Mar 28, 2026
fee7de0
fix: register private models
thisisiron Mar 28, 2026
8531218
docs: add usage example and expand DeepSeek-OCR-2 model doc
thisisiron Mar 28, 2026
f6fc20b
fix: add checkpoint to auto_docstring
thisisiron Mar 28, 2026
b4bfbf5
fix: remove comment
thisisiron Mar 28, 2026
e775577
fix: remove unused max_query
thisisiron Mar 28, 2026
44482df
fix: clean up DeepSeek-OCR2 modular
thisisiron Apr 3, 2026
4b1605a
Merge branch 'main' into add-deepseek_ocr2
thisisiron Apr 3, 2026
c6f5eaf
docs: update date
thisisiron Apr 3, 2026
f05c252
refactor: inherit SamVisionEncoder
thisisiron Apr 3, 2026
74ee9f3
refactor: use create_causal_mask with or_mask_function
thisisiron Apr 6, 2026
25c5454
Merge branch 'main' into add-deepseek_ocr2
thisisiron Apr 6, 2026
33a6159
Merge branch 'main' into add-deepseek_ocr2
thisisiron Apr 7, 2026
2f931aa
docs: update date
thisisiron Apr 7, 2026
d194f99
fix: address PR review
thisisiron Apr 9, 2026
c08b036
refactor: use modular for image processor
thisisiron Apr 9, 2026
5ce5029
refactor: restructure DeepSeek-OCR-2 config, image processor, and pro…
thisisiron Apr 9, 2026
917d086
Merge branch 'main' into add-deepseek_ocr2
thisisiron Apr 9, 2026
eec01bc
fix: sync hidden_size and rms_norm_eps from encoder_config to vision_…
thisisiron Apr 10, 2026
7ef2141
Merge branch 'main' into add-deepseek_ocr2
thisisiron Apr 10, 2026
10639ec
refactor: remove comment
thisisiron Apr 10, 2026
1efd730
fix
thisisiron Apr 10, 2026
9c70392
fix: correct EncoderConfig docstring example
thisisiron Apr 10, 2026
b9c75c6
refactor: add PIL image processor to modular
thisisiron Apr 10, 2026
1433b32
refactor: address review comments on config, processor, and model
thisisiron Apr 10, 2026
22392ad
Merge branch 'main' into add-deepseek_ocr2
thisisiron Apr 10, 2026
6b95f09
fix: adjust processing tests for image token expansion
thisisiron Apr 10, 2026
55fc4aa
fix: move view_separator to correct device for model parallelism
thisisiron Apr 11, 2026
68b5d42
Merge branch 'main' into add-deepseek_ocr2
thisisiron Apr 11, 2026
f85aabb
fix: add DeepseekOcr2ImageProcessorPil to __all__
thisisiron Apr 11, 2026
c96106a
fix: remove SDPA skip
thisisiron Apr 11, 2026
8dbfda5
test: skip offload/export tests
thisisiron Apr 11, 2026
8b391e6
refactor: address review comments
thisisiron Apr 14, 2026
0102cbf
Merge branch 'main' into add-deepseek_ocr2
thisisiron Apr 14, 2026
451fd53
fix: remove in DeepseekOcr2Model
thisisiron Apr 14, 2026
0f73ee0
refactor: enforce explicit tokens in DeepseekOcr2Processor
thisisiron Apr 15, 2026
50b210c
Merge branch 'main' into add-deepseek_ocr2
thisisiron Apr 15, 2026
f81b8b9
refactor: inherit DeepseekOcr2ImageProcessorKwargs from GotOcr2ImageP…
thisisiron Apr 15, 2026
d84deae
refactor: remove unused image processing
thisisiron Apr 15, 2026
54294ac
Update src/transformers/models/deepseek_ocr2/modular_deepseek_ocr2.py
thisisiron Apr 23, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/source/en/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -551,6 +551,8 @@
title: DeBERTa
- local: model_doc/deberta-v2
title: DeBERTa-v2
- local: model_doc/deepseek_ocr2
title: DeepSeek-OCR-2
- local: model_doc/deepseek_v2
title: DeepSeek-V2
- local: model_doc/deepseek_v3
Expand Down
101 changes: 101 additions & 0 deletions docs/source/en/model_doc/deepseek_ocr2.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
<!--Copyright 2026 The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.

⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.

-->
*This model was released on 2026-01-28 and added to Hugging Face Transformers on 2026-04-14.*

# DeepSeek-OCR-2


## Overview

The DeepSeek-OCR-2 model was proposed in [Visual Causal Flow: A Novel Approach to OCR-Specialized Vision-Language Models](https://huggingface.co/papers/2601.20552) by the DeepSeek team.

DeepSeek-OCR-2 is an OCR-specialized vision-language model built on a distinctive architecture: a SAM ViT-B vision encoder feeds into a Qwen2 hybrid attention encoder, which is connected through an MLP projector to a DeepSeek-V2 Mixture-of-Experts (MoE) language model. A key feature of the model is its hybrid attention mechanism, which applies bidirectional attention over image tokens and causal attention over query tokens, enabling efficient and accurate document understanding.

<img src="https://huggingface.co/deepseek-ai/DeepSeek-OCR-2/resolve/main/assets/fig1.png" width="800">
Comment thread
thisisiron marked this conversation as resolved.

<small> DeepSeek-OCR 2: Visual Causal Flow.</small>

This model was contributed by [thisisiron](https://huggingface.co/thisisiron).


## Usage example

### Plain OCR

```python
Comment thread
thisisiron marked this conversation as resolved.
>>> import torch
>>> from transformers import AutoProcessor, AutoModelForImageTextToText

>>> model = AutoModelForImageTextToText.from_pretrained(
... "thisisiron/DeepSeek-OCR-2-hf", dtype=torch.bfloat16, device_map="auto"
Comment thread
thisisiron marked this conversation as resolved.
... )
>>> processor = AutoProcessor.from_pretrained("thisisiron/DeepSeek-OCR-2-hf")

>>> image = "https://huggingface.co/datasets/hf-internal-testing/fixtures_got_ocr/resolve/main/image_ocr.jpg"
>>> inputs = processor(images=image, text="<image>\nFree OCR.", return_tensors="pt").to(model.device, dtype=torch.bfloat16)

>>> generate_ids = model.generate(**inputs, do_sample=False, max_new_tokens=4096)
Comment thread
thisisiron marked this conversation as resolved.
>>> processor.decode(generate_ids[0, inputs["input_ids"].shape[1] :], skip_special_tokens=True)
"R&D QUALITY IMPROVEMENT\nSUGGESTION/SOLUTION FORM\nName/Phone Ext. : (...)"
```

### Grounding with markdown conversion

The `<|grounding|>` token enables coordinate-aware output with `<|ref|>` and `<|det|>` tags.

```python
>>> inputs = processor(
... images=image,
... text="<image>\n<|grounding|>Convert the document to markdown.",
... return_tensors="pt",
... ).to(model.device, dtype=torch.bfloat16)

>>> generate_ids = model.generate(**inputs, do_sample=False, max_new_tokens=4096)
>>> processor.decode(generate_ids[0, inputs["input_ids"].shape[1] :], skip_special_tokens=False)
"<|ref|>title<|/ref|><|det|>[[330, 198, 558, 230]]<|/det|>\n# R&D QUALITY (...)"
```

## DeepseekOcr2Config

[[autodoc]] DeepseekOcr2Config

## DeepseekOcr2ImageProcessor

[[autodoc]] DeepseekOcr2ImageProcessor

## DeepseekOcr2ImageProcessorPil

[[autodoc]] DeepseekOcr2ImageProcessorPil

## DeepseekOcr2Processor

[[autodoc]] DeepseekOcr2Processor

## DeepseekOcr2TextModel

[[autodoc]] DeepseekOcr2TextModel

## DeepseekOcr2VisionModel

[[autodoc]] DeepseekOcr2VisionModel

## DeepseekOcr2Model

[[autodoc]] DeepseekOcr2Model

## DeepseekOcr2ForConditionalGeneration

[[autodoc]] DeepseekOcr2ForConditionalGeneration
1 change: 1 addition & 0 deletions src/transformers/models/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -91,6 +91,7 @@
from .deberta import *
from .deberta_v2 import *
from .decision_transformer import *
from .deepseek_ocr2 import *
from .deepseek_v2 import *
from .deepseek_v3 import *
from .deepseek_vl import *
Expand Down
2 changes: 2 additions & 0 deletions src/transformers/models/auto/configuration_auto.py
Comment thread
thisisiron marked this conversation as resolved.
Original file line number Diff line number Diff line change
Expand Up @@ -109,6 +109,7 @@
("deberta", "DebertaConfig"),
("deberta-v2", "DebertaV2Config"),
("decision_transformer", "DecisionTransformerConfig"),
("deepseek_ocr2", "DeepseekOcr2Config"),
("deepseek_v2", "DeepseekV2Config"),
("deepseek_v3", "DeepseekV3Config"),
("deepseek_vl", "DeepseekVLConfig"),
Expand Down Expand Up @@ -623,6 +624,7 @@
("deberta", "DeBERTa"),
("deberta-v2", "DeBERTa-v2"),
("decision_transformer", "Decision Transformer"),
("deepseek_ocr2", "DeepSeek-OCR-2"),
("deepseek_v2", "DeepSeek-V2"),
("deepseek_v3", "DeepSeek-V3"),
("deepseek_vl", "DeepseekVL"),
Expand Down
1 change: 1 addition & 0 deletions src/transformers/models/auto/image_processing_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -88,6 +88,7 @@
("convnextv2", {"torchvision": "ConvNextImageProcessor", "pil": "ConvNextImageProcessorPil"}),
("cvt", {"torchvision": "ConvNextImageProcessor", "pil": "ConvNextImageProcessorPil"}),
("data2vec-vision", {"torchvision": "BeitImageProcessor", "pil": "BeitImageProcessorPil"}),
("deepseek_ocr2", {"torchvision": "DeepseekOcr2ImageProcessor", "pil": "DeepseekOcr2ImageProcessorPil"}),
("deepseek_vl", {"torchvision": "DeepseekVLImageProcessor", "pil": "DeepseekVLImageProcessorPil"}),
(
"deepseek_vl_hybrid",
Expand Down
2 changes: 2 additions & 0 deletions src/transformers/models/auto/modeling_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -112,6 +112,7 @@ class _BaseModelWithGenerate(PreTrainedModel, GenerationMixin):
("deberta", "DebertaModel"),
("deberta-v2", "DebertaV2Model"),
("decision_transformer", "DecisionTransformerModel"),
("deepseek_ocr2", "DeepseekOcr2Model"),
("deepseek_v2", "DeepseekV2Model"),
("deepseek_v3", "DeepseekV3Model"),
("deepseek_vl", "DeepseekVLModel"),
Expand Down Expand Up @@ -969,6 +970,7 @@ class _BaseModelWithGenerate(PreTrainedModel, GenerationMixin):
("blip-2", "Blip2ForConditionalGeneration"),
("chameleon", "ChameleonForConditionalGeneration"),
("cohere2_vision", "Cohere2VisionForConditionalGeneration"),
("deepseek_ocr2", "DeepseekOcr2ForConditionalGeneration"),
("deepseek_vl", "DeepseekVLForConditionalGeneration"),
("deepseek_vl_hybrid", "DeepseekVLHybridForConditionalGeneration"),
("emu3", "Emu3ForConditionalGeneration"),
Expand Down
1 change: 1 addition & 0 deletions src/transformers/models/auto/processing_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,7 @@
("colmodernvbert", "ColModernVBertProcessor"),
("colpali", "ColPaliProcessor"),
("colqwen2", "ColQwen2Processor"),
("deepseek_ocr2", "DeepseekOcr2Processor"),
("deepseek_vl", "DeepseekVLProcessor"),
("deepseek_vl_hybrid", "DeepseekVLHybridProcessor"),
("dia", "DiaProcessor"),
Expand Down
30 changes: 30 additions & 0 deletions src/transformers/models/deepseek_ocr2/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
# Copyright 2026 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from typing import TYPE_CHECKING

from ...utils import _LazyModule
from ...utils.import_utils import define_import_structure


if TYPE_CHECKING:
from .configuration_deepseek_ocr2 import *
from .image_processing_deepseek_ocr2 import *
from .image_processing_pil_deepseek_ocr2 import *
from .modeling_deepseek_ocr2 import *
from .processing_deepseek_ocr2 import *
else:
import sys

_file = globals()["__file__"]
sys.modules[__name__] = _LazyModule(__name__, _file, define_import_structure(_file), module_spec=__spec__)
Loading
Loading