This repository contains a clean, production-ready set of OCR (Optical Character Recognition) tools designed for high-quality text extraction from PDFs and images, with a focus on resumes and structured documents.
- Multiple OCR Engines: Choose between different state-of-the-art OCR technologies.
- Hybrid Extraction: Combines advanced image preprocessing (OpenCV) with Docling for superior accuracy on noisy documents.
- Automated Cleaning: Built-in post-processing to remove OCR artifacts, fix common misreads, and normalize spacing.
- Text Optimization: Intelligent logic to detect and reposition text blocks for accurate extraction.
- GPU Support: Automatic GPU acceleration for faster processing when available.
- Base: IBM Docling
- Best for: Standard digital and high-quality scanned PDFs.
- Output: Clean Markdown with preserved table structures.
- Base: PaddleOCR
- Best for: Multilingual support and documents with complex layouts or handwriting.
- Process: Includes custom image preprocessing, de-skewing, and line merging.
- Process:
PyMuPDF (fitz)->OpenCV Preprocessing->Docling OCR - Best for: Low-quality scans, noisy backgrounds, or documents that fail with standard methods.
- Technique: Uses CLAHE (Contrast Limited Adaptive Histogram Equalization), adaptive thresholding, and denoising before extraction.
-
Clone the repository:
git clone <repo-url> cd Multi-Engine-OCR
-
Install dependencies:
pip install -r requirements.txt
Note: Ensure you have Tesseract and/or PaddlePaddle installed depending on your OS requirements.
You can run any OCR engine using the run.py script:
# Basic usage (defaults to Docling)
python run.py path/to/your/document.pdf
# Using PaddleOCR
python run.py path/to/your/document.pdf --engine paddle
# Using the Hybrid Engine
python run.py path/to/your/document.pdf --engine hybrid
# Specify output directory
python run.py document.pdf --output my_results
# Force CPU usage
python run.py document.pdf --no-gpu.
├── engines/
│ ├── docling_engine.py # Docling implementation
│ ├── paddle_engine.py # PaddleOCR implementation
│ └── hybrid_engine.py # Image + Docling hybrid implementation
├── utils/
│ └── text_processing.py # Cleaning & post-processing logic
├── run.py # Main entry point CLI
├── README.md # Documentation
└── requirements.txt # Project dependencies
📧 Email: hassanaiengineer@gmail.com
🔗 LinkedIn: Hassan Khan
🔗 Upwork: Hassan Khan on Upwork
Made with ❤️ by Hassan Khan

