An end-to-end OCR pipeline to extract, digitize, and structure voter information from scanned Indian electoral roll PDFs.
Electoral rolls in India are distributed as scanned PDF documents, making bulk data extraction a slow and manual process. This project automates that entirely β it ingests a raw electoral roll PDF, detects individual voter card regions on each page using computer vision, runs OCR over each card, and exports all structured voter records into a clean Excel workbook.
The pipeline is designed to closely mirror an exploratory notebook workflow while being modular, maintainable, and easy to run from the command line.
The pipeline accepts scanned Indian electoral roll PDFs (or pre-converted page images). Each page should follow the standard Election Commission of India layout:
-
Page header β Constituency name/number, section name, and part number printed at the top.
-
Voter card grid β Each page contains a 3-column grid of individual voter cards.
-
Per-card layout β Every voter card contains:
Field Example Serial Number 7(top-left corner of the box)EPIC Number IPD0100594(top-right of the box)Name Name : PALANI ALIAS PALANIVELRelative's Name Fathers Name: MANIHouse Number House Number : 4Age & Gender Age : 53 Gender : MalePhoto Placeholder A rectangular box labelled Photo Available
Sample page:
Pages that are cover pages or constituency headers (containing no voter records) can be skipped using the
--skip-front-pagesflag.
- PDF to image conversion β Automatically converts electoral roll PDFs to page images using Poppler
- Intelligent voter box detection β Detects individual voter card bounding boxes per page (width > 400px, height > 120px) using OpenCV contour analysis
- OCR with Tesseract β Extracts raw text from each voter card crop using
--psm 6(uniform block mode) - EPIC number extraction β Parses EPIC numbers from header crops with built-in OCR correction (3 letters + 7 digits pattern), with fallback to the first OCR line
- Structured field extraction β Pulls the following fields from every voter record:
- Serial Number
- EPIC Number
- Voter Name
- Relative Name & Relation Type
- House Number
- Age
- Gender
- Excel export β Outputs a formatted
.xlsxworkbook tooutput/voter_output.xlsx - Numeric page sorting β Correctly sorts
page_*.jpgfiles numerically, not lexicographically - Configurable front-page skipping β Optionally skip cover/header pages that don't contain voter records
- Auto-save during processing β Saves progress after every N pages to guard against interruptions
- Startup cleanup β Automatically removes legacy output files from older versions on launch
electoral-roll-ocr/
β
βββ main.py # Entry point β runs the full end-to-end extraction pipeline
β
βββ pipeline/
β βββ image_loader.py # Loads page images; handles PDF-to-image conversion
β βββ preprocessing.py # Page thresholding and voter card box detection
β βββ ocr_engine.py # Tesseract OCR helpers and EPIC number extraction logic
β βββ parser.py # Parses raw OCR text into structured voter fields
β βββ exporter.py # Writes the final formatted Excel workbook
β
βββ frontend/ # Web frontend for uploading PDFs and viewing results
β
βββ api.py # REST API layer (connects frontend to the pipeline)
βββ app.py # App server entry point
β
βββ images/ # Page images (auto-generated from PDF, or supplied manually)
βββ output/ # Output directory β voter_output.xlsx is saved here
β
βββ sample_data.pdf # Sample electoral roll PDF for testing
βββ requirements.txt # Python dependencies
βββ package.json # Frontend dependencies
Make sure the following are installed on your system before proceeding:
| Dependency | Purpose | Install |
|---|---|---|
| Python 3.8+ | Core runtime | python.org |
| Tesseract OCR | Text recognition engine | Installation guide |
| Poppler | PDF-to-image conversion | apt install poppler-utils / Windows builds |
git clone https://github.com/DeekshaR06/electoral-roll-ocr.git
cd electoral-roll-ocr
pip install -r requirements.txtpython main.py --pdf sample_data.pdfThis converts the PDF to page images, processes every page, and exports results to output/voter_output.xlsx.
If you have already converted the PDF to images (page_001.jpg, page_002.jpg, ...) placed in the images/ directory:
python main.py# Skip the first 2 pages (e.g. cover page, constituency header)
python main.py --pdf your_roll.pdf --skip-front-pages 2
# Specify custom image and output directories
python main.py --pdf your_roll.pdf --images images --output output
# Auto-save progress after every page (useful for large PDFs)
python main.py --pdf your_roll.pdf --autosave-every-pages 1
# Use the bundled sample PDF if no images are found
python main.py --use-sample-pdf-if-emptyIf Tesseract is not on your system PATH, point to it manually before running:
$env:TESSERACT_CMD = "C:\Program Files\Tesseract-OCR\tesseract.exe"
python main.py --pdf sample_data.pdfAfter a successful run, output/voter_output.xlsx will contain one row per voter with the following columns:
| Column | Description |
|---|---|
Serial Number |
Voter's serial number on the roll |
EPIC Number |
Unique Elector's Photo Identity Card number |
Name |
Voter's full name |
Relative Name |
Name of father / mother / spouse |
Relation Type |
Relation (Father / Mother / Husband / Wife) |
House Number |
Residential house number |
Age |
Voter's age |
Gender |
Male / Female / Other |
| Library | Role |
|---|---|
| OpenCV | Page preprocessing and voter box detection |
| Tesseract / pytesseract | OCR text extraction |
| pdf2image / Poppler | PDF to page image conversion |
| Pandas | Data structuring and manipulation |
| openpyxl | Formatted Excel workbook export |
Contributions are welcome! To get started:
- Fork the repository
- Create a feature branch (
git checkout -b feature/your-feature) - Commit your changes (
git commit -m 'Add your feature') - Push to the branch (
git push origin feature/your-feature) - Open a Pull Request
Please open an issue first for significant changes or new features so we can discuss the approach.
This project is licensed under the MIT License. See the LICENSE file for details.
| Deeksha R | Samudyatha K Bhat |
