Skip to content

AI-powered invoice and receipt parser that extracts structured data (vendor info, dates, amounts, line items) from document images using OCR engines (PaddleOCR, EasyOCR) and document understanding models (LayoutLMv3). Features REST API (FastAPI) and web interface (Streamlit).

Notifications You must be signed in to change notification settings

manhsontran/invoice-parser

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

1 Commit
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Invoice Parser

AI-powered Invoice/Receipt Parser using OCR and Document Understanding.

🎯 Features

  • Multiple OCR Engines: Support for PaddleOCR and EasyOCR
  • Document Understanding: Optional LayoutLMv3 integration
  • Structured Extraction: Extract vendor info, dates, amounts, line items
  • REST API: FastAPI backend for integration
  • Web Interface: Streamlit app for easy use
  • Multi-language: Support for English, Vietnamese, and more

πŸ“ Project Structure

invoice-parser/
β”œβ”€β”€ src/                          # Source code
β”‚   β”œβ”€β”€ __init__.py               # Package initialization
β”‚   β”œβ”€β”€ core/                     # Core functionality
β”‚   β”‚   β”œβ”€β”€ __init__.py
β”‚   β”‚   β”œβ”€β”€ config.py             # Configuration management
β”‚   β”‚   β”œβ”€β”€ logger.py             # Centralized logging
β”‚   β”‚   └── exceptions.py         # Custom exceptions
β”‚   β”œβ”€β”€ models/                   # Data models & schemas
β”‚   β”‚   β”œβ”€β”€ __init__.py
β”‚   β”‚   └── schemas.py            # Pydantic models
β”‚   β”œβ”€β”€ ocr/                      # OCR engines
β”‚   β”‚   β”œβ”€β”€ __init__.py
β”‚   β”‚   β”œβ”€β”€ base.py               # Base OCR class
β”‚   β”‚   β”œβ”€β”€ paddleocr_engine.py   # PaddleOCR implementation
β”‚   β”‚   β”œβ”€β”€ easyocr_engine.py     # EasyOCR implementation
β”‚   β”‚   └── ocr_factory.py        # OCR factory pattern
β”‚   β”œβ”€β”€ extraction/               # Field extraction
β”‚   β”‚   β”œβ”€β”€ __init__.py
β”‚   β”‚   β”œβ”€β”€ invoice_extractor.py  # Main extractor
β”‚   β”‚   β”œβ”€β”€ field_extractor.py    # Rule-based extraction
β”‚   β”‚   └── layoutlm_extractor.py # LayoutLM-based extraction
β”‚   β”œβ”€β”€ processing/               # Image/PDF processing
β”‚   β”‚   β”œβ”€β”€ __init__.py
β”‚   β”‚   β”œβ”€β”€ preprocessor.py       # Image preprocessing
β”‚   β”‚   β”œβ”€β”€ postprocessor.py      # Text postprocessing
β”‚   β”‚   └── pdf_handler.py        # PDF handling
β”‚   β”œβ”€β”€ utils/                    # Helper utilities
β”‚   β”‚   β”œβ”€β”€ __init__.py
β”‚   β”‚   β”œβ”€β”€ file_utils.py         # File operations
β”‚   β”‚   β”œβ”€β”€ text_utils.py         # Text processing
β”‚   β”‚   └── image_utils.py        # Image operations
β”‚   β”œβ”€β”€ api/                      # FastAPI REST API
β”‚   β”‚   β”œβ”€β”€ __init__.py
β”‚   β”‚   β”œβ”€β”€ app.py                # FastAPI app
β”‚   β”‚   └── routes.py             # API routes
β”‚   └── web/                      # Web interface
β”‚       β”œβ”€β”€ __init__.py
β”‚       └── streamlit_app.py      # Streamlit application
β”œβ”€β”€ app/                          # (Legacy) Old streamlit location
β”œβ”€β”€ config/
β”‚   └── config.yaml               # Configuration file
β”œβ”€β”€ data/                         # Data directories
β”‚   β”œβ”€β”€ uploads/                  # Uploaded files
β”‚   β”œβ”€β”€ outputs/                  # Processed outputs
β”‚   └── temp/                     # Temporary files
β”œβ”€β”€ tests/                        # Unit tests
β”‚   β”œβ”€β”€ conftest.py
β”‚   β”œβ”€β”€ test_api.py
β”‚   β”œβ”€β”€ test_extraction.py
β”‚   └── test_ocr.py
β”œβ”€β”€ .gitignore
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ pyproject.toml
β”œβ”€β”€ run_api.py                    # Run FastAPI server
└── run_app.py                    # Run Streamlit app

πŸš€ Quick Start

1. Installation

# Clone repository
cd invoice-parser

# Create virtual environment
python -m venv venv

# Activate virtual environment
# Windows:
venv\Scripts\activate
# Linux/Mac:
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

2. Run Web Interface (Streamlit)

python run_app.py

Open http://localhost:8501 in your browser.

3. Run API Server

python run_api.py

API documentation available at http://localhost:8000/docs

πŸ“– Usage

Python API

from src.extraction import InvoiceExtractor

# Initialize extractor
extractor = InvoiceExtractor(
    ocr_engine="paddleocr",
    languages=["en", "vi"],
    use_gpu=False,
)

# Extract from image
result = extractor.extract("path/to/invoice.jpg")

# Access extracted data
print(f"Invoice #: {result.invoice_number}")
print(f"Date: {result.invoice_date}")
print(f"Total: {result.currency} {result.total}")
print(f"Vendor: {result.vendor_name}")

# Export to JSON
print(result.to_json())

REST API

# Parse single invoice
curl -X POST "http://localhost:8000/api/v1/parse" \
  -H "accept: application/json" \
  -H "Content-Type: multipart/form-data" \
  -F "file=@invoice.jpg"

Response Format

{
  "success": true,
  "message": "Invoice parsed successfully",
  "data": {
    "vendor_name": "ACME Corporation",
    "invoice_number": "INV-2024-001",
    "invoice_date": "2024-01-15",
    "due_date": "2024-02-15",
    "subtotal": 1000.00,
    "tax_amount": 100.00,
    "total": 1100.00,
    "currency": "USD",
    "line_items": [
      {
        "description": "Product A",
        "quantity": 10,
        "unit_price": 100.00,
        "amount": 1000.00
      }
    ],
    "confidence_score": 0.85
  },
  "processing_time": 1.234
}

βš™οΈ Configuration

Edit config/config.yaml:

ocr:
  engine: "paddleocr"  # or "easyocr"
  language: ["en", "vi"]
  use_gpu: false

document_ai:
  model_name: "microsoft/layoutlmv3-base"
  use_gpu: false

api:
  host: "0.0.0.0"
  port: 8000
  max_file_size_mb: 10

πŸ§ͺ Testing

# Run tests
pytest tests/ -v

# Run with coverage
pytest tests/ --cov=src --cov-report=html

πŸ› οΈ Development

Adding Custom OCR Engine

from src.ocr.base import BaseOCR, OCRResult
from src.ocr import OCRFactory

class MyOCREngine(BaseOCR):
    def _initialize_model(self):
        # Initialize your model
        pass
    
    def _process_image(self, image):
        # Process image and return OCRResult
        pass

# Register engine
OCRFactory.register_engine("myocr", MyOCREngine)

Adding Custom Extraction Rules

from src.extraction.field_extractor import FieldExtractor, ExtractionRule

extractor = FieldExtractor()
extractor.add_rule(ExtractionRule(
    name="po_number",
    patterns=[
        r"PO[\s#:]*([A-Z0-9\-]+)",
        r"Purchase Order[\s#:]*([A-Z0-9\-]+)",
    ],
))

πŸ“š Technologies

  • OCR: PaddleOCR, EasyOCR
  • Document AI: LayoutLMv3 (HuggingFace Transformers)
  • API: FastAPI, Pydantic
  • Web UI: Streamlit
  • Image Processing: OpenCV, Pillow
  • PDF: pdf2image, PyPDF2

πŸ—ΊοΈ Roadmap

  • Fine-tune LayoutLMv3 on custom dataset
  • Add support for more document types (receipts, bills)
  • Multi-page document support
  • Table extraction enhancement
  • Cloud deployment (Docker, Kubernetes)
  • Integration with accounting software

πŸ“„ License

MIT License

🀝 Contributing

Contributions are welcome! Please read the contributing guidelines first.

About

AI-powered invoice and receipt parser that extracts structured data (vendor info, dates, amounts, line items) from document images using OCR engines (PaddleOCR, EasyOCR) and document understanding models (LayoutLMv3). Features REST API (FastAPI) and web interface (Streamlit).

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages