Skip to content

hassanaiengineer/Multi-Engine-OCR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Professional OCR Repository

This repository contains a clean, production-ready set of OCR (Optical Character Recognition) tools designed for high-quality text extraction from PDFs and images, with a focus on resumes and structured documents.

🚀 Features

  • Multiple OCR Engines: Choose between different state-of-the-art OCR technologies.
  • Hybrid Extraction: Combines advanced image preprocessing (OpenCV) with Docling for superior accuracy on noisy documents.
  • Automated Cleaning: Built-in post-processing to remove OCR artifacts, fix common misreads, and normalize spacing.
  • Text Optimization: Intelligent logic to detect and reposition text blocks for accurate extraction.
  • GPU Support: Automatic GPU acceleration for faster processing when available.

🛠️ Available OCR Engines

1. Docling OCR (docling)

Docling Illustration

  • Base: IBM Docling
  • Best for: Standard digital and high-quality scanned PDFs.
  • Output: Clean Markdown with preserved table structures.

2. PaddleOCR (paddle)

PaddleOCR Illustration

  • Base: PaddleOCR
  • Best for: Multilingual support and documents with complex layouts or handwriting.
  • Process: Includes custom image preprocessing, de-skewing, and line merging.

3. Hybrid Engine (hybrid)

  • Process: PyMuPDF (fitz) -> OpenCV Preprocessing -> Docling OCR
  • Best for: Low-quality scans, noisy backgrounds, or documents that fail with standard methods.
  • Technique: Uses CLAHE (Contrast Limited Adaptive Histogram Equalization), adaptive thresholding, and denoising before extraction.

📦 Installation

  1. Clone the repository:

    git clone <repo-url>
    cd Multi-Engine-OCR
  2. Install dependencies:

    pip install -r requirements.txt

    Note: Ensure you have Tesseract and/or PaddlePaddle installed depending on your OS requirements.


🏃 Usage

You can run any OCR engine using the run.py script:

# Basic usage (defaults to Docling)
python run.py path/to/your/document.pdf

# Using PaddleOCR
python run.py path/to/your/document.pdf --engine paddle

# Using the Hybrid Engine
python run.py path/to/your/document.pdf --engine hybrid

# Specify output directory
python run.py document.pdf --output my_results

# Force CPU usage
python run.py document.pdf --no-gpu

📁 Repository Structure

.
├── engines/
│   ├── docling_engine.py   # Docling implementation
│   ├── paddle_engine.py    # PaddleOCR implementation
│   └── hybrid_engine.py    # Image + Docling hybrid implementation
├── utils/
│   └── text_processing.py  # Cleaning & post-processing logic
├── run.py                  # Main entry point CLI
├── README.md               # Documentation
└── requirements.txt        # Project dependencies

Meet the Developer

📧 Email: hassanaiengineer@gmail.com
🔗 LinkedIn: Hassan Khan
🔗 Upwork: Hassan Khan on Upwork


Made with ❤️ by Hassan Khan

About

Advanced OCR repository combining Docling, PaddleOCR, and OpenCV preprocessing for superior accuracy on noisy documents and resumes.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages