OCR PDF Extractor

This is a Python-based application for extracting text from scanned PDF files. It uses Tesseract OCR and Poppler to process each page, and outputs the recognized text to a .txt file. The tool includes a simple file selector and displays progress in the terminal. It's also packaged as a standalone .exe using PyInstaller, so it can run on any Windows machine without requiring Python or additional installations.

Features

Converts scanned PDF pages to images using Poppler
Extracts text using Tesseract OCR (supports Spanish)
Displays progress using a terminal progress bar
Saves the extracted text to texto_extraido.txt
Automatically opens the text file after processing
Can be compiled into a single-file .exe

Requirements (for development)

Python 3.8 or later
Tesseract OCR (installed locally)
Poppler for Windows
Python packages: pytesseract, pdf2image, tqdm
spa.traineddata file for Spanish OCR

⚠️ Language Support

This app uses Tesseract OCR with support for both English and Spanish.
Make sure your Tesseract installation includes the following language data files:

eng.traineddata (included by default)
spa.traineddata (must be downloaded manually if not present)

To install Spanish language support, download spa.traineddata from the official repo:
https://github.com/tesseract-ocr/tessdata

Place the file inside the tessdata folder of your Tesseract installation directory.

Installation Steps

1. Install Tesseract OCR

Download the Windows installer from:
https://github.com/UB-Mannheim/tesseract/wiki
Install it to:
C:\Tesseract-OCR or a similar folder
Copy the full path to tesseract.exe for later use
Make sure the folder tessdata/ includes spa.traineddata (for Spanish).

2. Install Poppler for Windows

Download from:
https://github.com/oschwartz10612/poppler-windows/releases/
Extract it to a folder such as:
C:\poppler
The path to use in code is:
C:\poppler\Library\bin

3. Install Python dependencies

pip install pytesseract pdf2image tqdm

Running the App (from Python)

To use the application from source:

Make sure Python and all dependencies are installed.
Open a terminal in the project directory.
Run the script:

python ocr_pdf.py

A file picker window will appear. Select a scanned PDF file.
The extracted text will be saved to texto_extraido.txt and opened automatically.

Building the Executable (.exe)

You can compile the project as a portable .exe using PyInstaller.

Project Structure Required Your project folder should look like this:

ocr_pdf/
├── ocr_pdf.py
├── README.md
├── requirements.txt
├── Tesseract-OCR/
│   └── tesseract.exe, tessdata/, etc.
└── poppler/
    └── Library/
        └── bin/
            └── pdfinfo.exe, other DLLs...

Run PyInstaller Inside the ocr_pdf folder, run:

pyinstaller --onefile ^
  --add-data "Tesseract-OCR;Tesseract-OCR" ^
  --add-data "poppler\\Library\\bin;poppler\\Library\\bin" ^
  ocr_pdf.py

Notes: The --add-data argument ensures all necessary binaries are included. The resulting executable will be created in the dist/ folder as ocr_pdf.exe.

Distributing the Executable

You can share the .exe file from dist/ directly. The end user can:

Double-click to open the application.
Select a PDF file.
Receive the extracted text in a plain .txt file, opened automatically.

No Python, Tesseract, or Poppler installations are needed on the target machine.

This application was developed for a freelance client who needed to extract Spanish-language text from scanned PDF documents. The final product is a self-contained .exe that works on any Windows machine and outputs the OCR results to a text file with no installation required.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.gitignore		.gitignore
README.md		README.md
ocr_pdf.py		ocr_pdf.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OCR PDF Extractor

Features

Requirements (for development)

⚠️ Language Support

Installation Steps

1. Install Tesseract OCR

2. Install Poppler for Windows

3. Install Python dependencies

Running the App (from Python)

Building the Executable (.exe)

Distributing the Executable

About

Uh oh!

Releases

Packages

Uh oh!

Languages

maxipedrero/ocr_pdf

Folders and files

Latest commit

History

Repository files navigation

OCR PDF Extractor

Features

Requirements (for development)

⚠️ Language Support

Installation Steps

1. Install Tesseract OCR

2. Install Poppler for Windows

3. Install Python dependencies

Running the App (from Python)

Building the Executable (.exe)

Distributing the Executable

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages