AI-powered Invoice/Receipt Parser using OCR and Document Understanding.
- Multiple OCR Engines: Support for PaddleOCR and EasyOCR
- Document Understanding: Optional LayoutLMv3 integration
- Structured Extraction: Extract vendor info, dates, amounts, line items
- REST API: FastAPI backend for integration
- Web Interface: Streamlit app for easy use
- Multi-language: Support for English, Vietnamese, and more
invoice-parser/
βββ src/ # Source code
β βββ __init__.py # Package initialization
β βββ core/ # Core functionality
β β βββ __init__.py
β β βββ config.py # Configuration management
β β βββ logger.py # Centralized logging
β β βββ exceptions.py # Custom exceptions
β βββ models/ # Data models & schemas
β β βββ __init__.py
β β βββ schemas.py # Pydantic models
β βββ ocr/ # OCR engines
β β βββ __init__.py
β β βββ base.py # Base OCR class
β β βββ paddleocr_engine.py # PaddleOCR implementation
β β βββ easyocr_engine.py # EasyOCR implementation
β β βββ ocr_factory.py # OCR factory pattern
β βββ extraction/ # Field extraction
β β βββ __init__.py
β β βββ invoice_extractor.py # Main extractor
β β βββ field_extractor.py # Rule-based extraction
β β βββ layoutlm_extractor.py # LayoutLM-based extraction
β βββ processing/ # Image/PDF processing
β β βββ __init__.py
β β βββ preprocessor.py # Image preprocessing
β β βββ postprocessor.py # Text postprocessing
β β βββ pdf_handler.py # PDF handling
β βββ utils/ # Helper utilities
β β βββ __init__.py
β β βββ file_utils.py # File operations
β β βββ text_utils.py # Text processing
β β βββ image_utils.py # Image operations
β βββ api/ # FastAPI REST API
β β βββ __init__.py
β β βββ app.py # FastAPI app
β β βββ routes.py # API routes
β βββ web/ # Web interface
β βββ __init__.py
β βββ streamlit_app.py # Streamlit application
βββ app/ # (Legacy) Old streamlit location
βββ config/
β βββ config.yaml # Configuration file
βββ data/ # Data directories
β βββ uploads/ # Uploaded files
β βββ outputs/ # Processed outputs
β βββ temp/ # Temporary files
βββ tests/ # Unit tests
β βββ conftest.py
β βββ test_api.py
β βββ test_extraction.py
β βββ test_ocr.py
βββ .gitignore
βββ requirements.txt
βββ pyproject.toml
βββ run_api.py # Run FastAPI server
βββ run_app.py # Run Streamlit app
# Clone repository
cd invoice-parser
# Create virtual environment
python -m venv venv
# Activate virtual environment
# Windows:
venv\Scripts\activate
# Linux/Mac:
source venv/bin/activate
# Install dependencies
pip install -r requirements.txtpython run_app.pyOpen http://localhost:8501 in your browser.
python run_api.pyAPI documentation available at http://localhost:8000/docs
from src.extraction import InvoiceExtractor
# Initialize extractor
extractor = InvoiceExtractor(
ocr_engine="paddleocr",
languages=["en", "vi"],
use_gpu=False,
)
# Extract from image
result = extractor.extract("path/to/invoice.jpg")
# Access extracted data
print(f"Invoice #: {result.invoice_number}")
print(f"Date: {result.invoice_date}")
print(f"Total: {result.currency} {result.total}")
print(f"Vendor: {result.vendor_name}")
# Export to JSON
print(result.to_json())# Parse single invoice
curl -X POST "http://localhost:8000/api/v1/parse" \
-H "accept: application/json" \
-H "Content-Type: multipart/form-data" \
-F "file=@invoice.jpg"{
"success": true,
"message": "Invoice parsed successfully",
"data": {
"vendor_name": "ACME Corporation",
"invoice_number": "INV-2024-001",
"invoice_date": "2024-01-15",
"due_date": "2024-02-15",
"subtotal": 1000.00,
"tax_amount": 100.00,
"total": 1100.00,
"currency": "USD",
"line_items": [
{
"description": "Product A",
"quantity": 10,
"unit_price": 100.00,
"amount": 1000.00
}
],
"confidence_score": 0.85
},
"processing_time": 1.234
}Edit config/config.yaml:
ocr:
engine: "paddleocr" # or "easyocr"
language: ["en", "vi"]
use_gpu: false
document_ai:
model_name: "microsoft/layoutlmv3-base"
use_gpu: false
api:
host: "0.0.0.0"
port: 8000
max_file_size_mb: 10# Run tests
pytest tests/ -v
# Run with coverage
pytest tests/ --cov=src --cov-report=htmlfrom src.ocr.base import BaseOCR, OCRResult
from src.ocr import OCRFactory
class MyOCREngine(BaseOCR):
def _initialize_model(self):
# Initialize your model
pass
def _process_image(self, image):
# Process image and return OCRResult
pass
# Register engine
OCRFactory.register_engine("myocr", MyOCREngine)from src.extraction.field_extractor import FieldExtractor, ExtractionRule
extractor = FieldExtractor()
extractor.add_rule(ExtractionRule(
name="po_number",
patterns=[
r"PO[\s#:]*([A-Z0-9\-]+)",
r"Purchase Order[\s#:]*([A-Z0-9\-]+)",
],
))- OCR: PaddleOCR, EasyOCR
- Document AI: LayoutLMv3 (HuggingFace Transformers)
- API: FastAPI, Pydantic
- Web UI: Streamlit
- Image Processing: OpenCV, Pillow
- PDF: pdf2image, PyPDF2
- Fine-tune LayoutLMv3 on custom dataset
- Add support for more document types (receipts, bills)
- Multi-page document support
- Table extraction enhancement
- Cloud deployment (Docker, Kubernetes)
- Integration with accounting software
MIT License
Contributions are welcome! Please read the contributing guidelines first.