-
Notifications
You must be signed in to change notification settings - Fork 33k
Add Deepseek-OCR-2 model #45075
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
thisisiron
wants to merge
56
commits into
huggingface:main
Choose a base branch
from
thisisiron:add-deepseek_ocr2
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Add Deepseek-OCR-2 model #45075
Changes from all commits
Commits
Show all changes
56 commits
Select commit
Hold shift + click to select a range
e6bed84
feat: add DeepSeek-OCR-2 model registration and docs
thisisiron 7a1a400
feat: add DeepseekOcr2ImageProcessor
thisisiron f3ec285
feat: add DeepseekOcr2Processor and refactor image processor tile_size
thisisiron 95b0d57
feat: add script to convert DeepSeek-OCR-2 weights to Hugging Face fo…
thisisiron b81bc0f
feat: enhance DeepSeek-OCR-2 processing and inference test
thisisiron bbf3f94
refactor
thisisiron 2e05944
feat: add fast image processor
thisisiron 8f3270b
refactor
thisisiron 3ee14eb
test: add image processor tests for DeepseekOcr2
thisisiron 58b683e
fix: make background_color configurable
thisisiron 3710499
refactor: migrate image processors to pil/torchvision backend pattern
thisisiron 1bfa054
test: add processor tests for DeepseekOcr2
thisisiron 1878af2
fix: update __init__
thisisiron 46ddaec
chore: clean up unused imports and fix formatting
thisisiron 8754d8d
feat: add configuration, modeling, and modular for DeepseekOcr2
thisisiron fef7b5f
fix: style fixes, update docs, and minor cleanups
thisisiron 5520c31
fix: use @strict
thisisiron fee7de0
fix: register private models
thisisiron 8531218
docs: add usage example and expand DeepSeek-OCR-2 model doc
thisisiron f6fc20b
fix: add checkpoint to auto_docstring
thisisiron b4bfbf5
fix: remove comment
thisisiron e775577
fix: remove unused max_query
thisisiron 44482df
fix: clean up DeepSeek-OCR2 modular
thisisiron 4b1605a
Merge branch 'main' into add-deepseek_ocr2
thisisiron c6f5eaf
docs: update date
thisisiron f05c252
refactor: inherit SamVisionEncoder
thisisiron 74ee9f3
refactor: use create_causal_mask with or_mask_function
thisisiron 25c5454
Merge branch 'main' into add-deepseek_ocr2
thisisiron 33a6159
Merge branch 'main' into add-deepseek_ocr2
thisisiron 2f931aa
docs: update date
thisisiron d194f99
fix: address PR review
thisisiron c08b036
refactor: use modular for image processor
thisisiron 5ce5029
refactor: restructure DeepSeek-OCR-2 config, image processor, and pro…
thisisiron 917d086
Merge branch 'main' into add-deepseek_ocr2
thisisiron eec01bc
fix: sync hidden_size and rms_norm_eps from encoder_config to vision_…
thisisiron 7ef2141
Merge branch 'main' into add-deepseek_ocr2
thisisiron 10639ec
refactor: remove comment
thisisiron 1efd730
fix
thisisiron 9c70392
fix: correct EncoderConfig docstring example
thisisiron b9c75c6
refactor: add PIL image processor to modular
thisisiron 1433b32
refactor: address review comments on config, processor, and model
thisisiron 22392ad
Merge branch 'main' into add-deepseek_ocr2
thisisiron 6b95f09
fix: adjust processing tests for image token expansion
thisisiron 55fc4aa
fix: move view_separator to correct device for model parallelism
thisisiron 68b5d42
Merge branch 'main' into add-deepseek_ocr2
thisisiron f85aabb
fix: add DeepseekOcr2ImageProcessorPil to __all__
thisisiron c96106a
fix: remove SDPA skip
thisisiron 8dbfda5
test: skip offload/export tests
thisisiron 8b391e6
refactor: address review comments
thisisiron 0102cbf
Merge branch 'main' into add-deepseek_ocr2
thisisiron 451fd53
fix: remove in DeepseekOcr2Model
thisisiron 0f73ee0
refactor: enforce explicit tokens in DeepseekOcr2Processor
thisisiron 50b210c
Merge branch 'main' into add-deepseek_ocr2
thisisiron f81b8b9
refactor: inherit DeepseekOcr2ImageProcessorKwargs from GotOcr2ImageP…
thisisiron d84deae
refactor: remove unused image processing
thisisiron 54294ac
Update src/transformers/models/deepseek_ocr2/modular_deepseek_ocr2.py
thisisiron File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,101 @@ | ||
| <!--Copyright 2026 The HuggingFace Team. All rights reserved. | ||
|
|
||
| Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with | ||
| the License. You may obtain a copy of the License at | ||
|
|
||
| http://www.apache.org/licenses/LICENSE-2.0 | ||
|
|
||
| Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on | ||
| an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the | ||
| specific language governing permissions and limitations under the License. | ||
|
|
||
| ⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be | ||
| rendered properly in your Markdown viewer. | ||
|
|
||
| --> | ||
| *This model was released on 2026-01-28 and added to Hugging Face Transformers on 2026-04-14.* | ||
|
|
||
| # DeepSeek-OCR-2 | ||
|
|
||
|
|
||
| ## Overview | ||
|
|
||
| The DeepSeek-OCR-2 model was proposed in [Visual Causal Flow: A Novel Approach to OCR-Specialized Vision-Language Models](https://huggingface.co/papers/2601.20552) by the DeepSeek team. | ||
|
|
||
| DeepSeek-OCR-2 is an OCR-specialized vision-language model built on a distinctive architecture: a SAM ViT-B vision encoder feeds into a Qwen2 hybrid attention encoder, which is connected through an MLP projector to a DeepSeek-V2 Mixture-of-Experts (MoE) language model. A key feature of the model is its hybrid attention mechanism, which applies bidirectional attention over image tokens and causal attention over query tokens, enabling efficient and accurate document understanding. | ||
|
|
||
| <img src="https://huggingface.co/deepseek-ai/DeepSeek-OCR-2/resolve/main/assets/fig1.png" width="800"> | ||
|
|
||
| <small> DeepSeek-OCR 2: Visual Causal Flow.</small> | ||
|
|
||
| This model was contributed by [thisisiron](https://huggingface.co/thisisiron). | ||
|
|
||
|
|
||
| ## Usage example | ||
|
|
||
| ### Plain OCR | ||
|
|
||
| ```python | ||
|
thisisiron marked this conversation as resolved.
|
||
| >>> import torch | ||
| >>> from transformers import AutoProcessor, AutoModelForImageTextToText | ||
|
|
||
| >>> model = AutoModelForImageTextToText.from_pretrained( | ||
| ... "thisisiron/DeepSeek-OCR-2-hf", dtype=torch.bfloat16, device_map="auto" | ||
|
thisisiron marked this conversation as resolved.
|
||
| ... ) | ||
| >>> processor = AutoProcessor.from_pretrained("thisisiron/DeepSeek-OCR-2-hf") | ||
|
|
||
| >>> image = "https://huggingface.co/datasets/hf-internal-testing/fixtures_got_ocr/resolve/main/image_ocr.jpg" | ||
| >>> inputs = processor(images=image, text="<image>\nFree OCR.", return_tensors="pt").to(model.device, dtype=torch.bfloat16) | ||
|
|
||
| >>> generate_ids = model.generate(**inputs, do_sample=False, max_new_tokens=4096) | ||
|
thisisiron marked this conversation as resolved.
|
||
| >>> processor.decode(generate_ids[0, inputs["input_ids"].shape[1] :], skip_special_tokens=True) | ||
| "R&D QUALITY IMPROVEMENT\nSUGGESTION/SOLUTION FORM\nName/Phone Ext. : (...)" | ||
| ``` | ||
|
|
||
| ### Grounding with markdown conversion | ||
|
|
||
| The `<|grounding|>` token enables coordinate-aware output with `<|ref|>` and `<|det|>` tags. | ||
|
|
||
| ```python | ||
| >>> inputs = processor( | ||
| ... images=image, | ||
| ... text="<image>\n<|grounding|>Convert the document to markdown.", | ||
| ... return_tensors="pt", | ||
| ... ).to(model.device, dtype=torch.bfloat16) | ||
|
|
||
| >>> generate_ids = model.generate(**inputs, do_sample=False, max_new_tokens=4096) | ||
| >>> processor.decode(generate_ids[0, inputs["input_ids"].shape[1] :], skip_special_tokens=False) | ||
| "<|ref|>title<|/ref|><|det|>[[330, 198, 558, 230]]<|/det|>\n# R&D QUALITY (...)" | ||
| ``` | ||
|
|
||
| ## DeepseekOcr2Config | ||
|
|
||
| [[autodoc]] DeepseekOcr2Config | ||
|
|
||
| ## DeepseekOcr2ImageProcessor | ||
|
|
||
| [[autodoc]] DeepseekOcr2ImageProcessor | ||
|
|
||
| ## DeepseekOcr2ImageProcessorPil | ||
|
|
||
| [[autodoc]] DeepseekOcr2ImageProcessorPil | ||
|
|
||
| ## DeepseekOcr2Processor | ||
|
|
||
| [[autodoc]] DeepseekOcr2Processor | ||
|
|
||
| ## DeepseekOcr2TextModel | ||
|
|
||
| [[autodoc]] DeepseekOcr2TextModel | ||
|
|
||
| ## DeepseekOcr2VisionModel | ||
|
|
||
| [[autodoc]] DeepseekOcr2VisionModel | ||
|
|
||
| ## DeepseekOcr2Model | ||
|
|
||
| [[autodoc]] DeepseekOcr2Model | ||
|
|
||
| ## DeepseekOcr2ForConditionalGeneration | ||
|
|
||
| [[autodoc]] DeepseekOcr2ForConditionalGeneration | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
|
thisisiron marked this conversation as resolved.
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,30 @@ | ||
| # Copyright 2026 The HuggingFace Team. All rights reserved. | ||
| # | ||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||
| # you may not use this file except in compliance with the License. | ||
| # You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, software | ||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
| from typing import TYPE_CHECKING | ||
|
|
||
| from ...utils import _LazyModule | ||
| from ...utils.import_utils import define_import_structure | ||
|
|
||
|
|
||
| if TYPE_CHECKING: | ||
| from .configuration_deepseek_ocr2 import * | ||
| from .image_processing_deepseek_ocr2 import * | ||
| from .image_processing_pil_deepseek_ocr2 import * | ||
| from .modeling_deepseek_ocr2 import * | ||
| from .processing_deepseek_ocr2 import * | ||
| else: | ||
| import sys | ||
|
|
||
| _file = globals()["__file__"] | ||
| sys.modules[__name__] = _LazyModule(__name__, _file, define_import_structure(_file), module_spec=__spec__) |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.