Skip to content

Extract invoices and receipts from Gmail using Gemini AI

License

Notifications You must be signed in to change notification settings

scailetech/openinvoice

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

9 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Gmail Invoice & Receipt Extractor

A production-grade Python tool to extract invoices and receipts from Gmail emails and extract structured data using Google's Gemini AI.

Features

  • πŸ“§ Gmail Integration: Fetches emails from any Gmail inbox using IMAP
  • πŸ” Smart Filtering: Filters emails by keywords (invoice, receipt, rechnung, quittung, etc.) in subject or body
  • πŸ“Ž Attachment Extraction: Automatically extracts all attachments from matching emails
  • πŸ€– AI-Powered Extraction: Uses Gemini 2.0 Flash to extract structured invoice data from PDFs
  • ⚑ Parallel Processing: Processes invoices in parallel for fast extraction
  • πŸ”„ Resume Capability: Automatically resumes from where it left off
  • πŸ›‘οΈ Robust Error Handling: Retry logic, comprehensive error handling, and detailed logging
  • πŸ“Š Structured Output: Exports all extracted data to JSON with full metadata
  • βœ… Comprehensive Tests: Full test suite with 37 test cases covering all functionality

Quick Start

1. Prerequisites

  • Python 3.8+
  • Gmail account with 2-Step Verification enabled
  • Google App Password (for Gmail access)
  • Gemini API key (for invoice extraction)

2. Installation

# Clone the repository
git clone https://github.com/federicodeponte/gmail-invoice-extractor.git
cd gmail-invoice-extractor

# Install dependencies
pip install -r requirements.txt

# Install system dependencies for PDF processing (macOS)
brew install poppler

# Or on Ubuntu/Debian
sudo apt-get install poppler-utils

3. Configuration

Create a .env.local file (or copy from .env.example):

# Gmail credentials
GMAIL_EMAIL=your.email@gmail.com
GMAIL_APP_PASSWORD=your_16_char_app_password

# Gemini API key
GEMINI_API_KEY=your_gemini_api_key

Getting a Gmail App Password:

  1. Go to https://myaccount.google.com/apppasswords
  2. Generate a new app password for "Mail"
  3. Copy the 16-character password

Getting a Gemini API Key:

  1. Go to https://makersuite.google.com/app/apikey
  2. Create a new API key
  3. Copy the key

4. Usage

Step 1: Fetch Emails and Extract Attachments

python3 fetch_emails.py

This will:

  • Connect to your Gmail inbox
  • Search for emails containing invoice/receipt keywords
  • Extract all attachments to email_attachments/ directory
  • Show progress and summary

Filter Keywords (configurable in fetch_emails.py):

  • English: invoice, receipt, bill
  • German: rechnung, quittung, beleg, zahlungsbeleg, gutschrift, abrechnung, rechnungsbeleg

Step 2: Extract Invoice Data with Gemini

python3 extract_invoices_gemini.py

This will:

  • Process all PDF files in email_attachments/
  • Extract structured invoice data using Gemini AI
  • Save results to invoices_extracted.json
  • Show real-time progress

Output Format (invoices_extracted.json):

{
  "extraction_summary": {
    "total_invoices": 887,
    "successful_extractions": 886,
    "failed_extractions": 1,
    "extraction_date": "2025-11-07T14:30:00",
    "gemini_model": "gemini-2.0-flash",
    "processing_time_seconds": 1234.56,
    "parallel_workers": 50
  },
  "invoices": [
    {
      "success": true,
      "filename": "invoice_123.pdf",
      "file_path": "email_attachments/invoice_123.pdf",
      "invoice_data": {
        "vendor": "Company Name",
        "amount": 123.45,
        "currency": "EUR",
        "date": "2025-11-01",
        "invoice_number": "INV-2025-001",
        "description": "Services rendered",
        "tax_amount": 19.45,
        "tax_rate": 19.0,
        "items": [
          {
            "name": "Service Item",
            "quantity": 1,
            "price": 123.45,
            "total": 123.45
          }
        ]
      },
      "extraction_method": "gemini-2.0-flash",
      "extraction_timestamp": "2025-11-07T14:30:00",
      "extraction_time": 5.23,
      "image_conversion_time": 1.45,
      "gemini_api_time": 3.78,
      "file_size_bytes": 45678
    }
  ]
}

Configuration

Email Filtering

Edit FILTER_KEYWORDS in fetch_emails.py to customize which emails are processed:

FILTER_KEYWORDS = [
    "invoice",
    "rechnung",
    "receipt",
    "quittung",
    # Add your own keywords...
]

Date Range Filtering

To process only invoices from a specific date range, edit extract_invoices_gemini.py:

config = ExtractionConfig(
    filter_date_start="2025-10-01 00:00:00",  # Only process invoices modified after this date
    # ...
)

Parallel Processing

Adjust the number of parallel workers in extract_invoices_gemini.py:

config = ExtractionConfig(
    max_workers=50,  # Increase for faster processing (depends on API limits)
    # ...
)

Note: Free tier Gemini API has rate limits (~15 requests/minute). Pro tier can handle much higher throughput.

Architecture

The codebase follows SOLID principles and is modular:

  • fetch_emails.py: Gmail IMAP client for fetching emails and extracting attachments
  • extract_invoices_gemini.py: Modular invoice extraction system with:
    • ExtractionConfig: Configuration dataclass (settings, API keys, retry logic)
    • ExtractionResult: Result dataclass (success/failure, invoice data, metadata)
    • ConfigManager: Handles API keys and Gemini initialization
    • InvoiceFinder: Finds and filters invoice files, manages resume capability
    • InvoiceExtractor: Converts PDFs to images and calls Gemini API with retry logic
    • ExtractionProcessor: Orchestrates parallel processing with ThreadPoolExecutor
    • ResultManager: Handles result persistence (loading/saving JSON)
    • InvoiceExtractionService: Main orchestrator tying all components together

Class Responsibilities

  • Single Responsibility: Each class has one clear purpose
  • Dependency Injection: Config passed to all classes
  • Separation of Concerns: Email fetching separate from invoice extraction
  • Error Isolation: Errors handled at appropriate levels

Testing

The project includes a comprehensive test suite with 37 test cases:

# Run all tests
pytest tests/ -v

# Run with coverage report
pytest tests/ --cov=extract_invoices_gemini --cov=fetch_emails --cov-report=html

# Run specific test file
pytest tests/test_config.py -v

Test Coverage:

  • βœ… Unit tests for all classes and methods
  • βœ… Integration tests for complete workflow
  • βœ… Error handling and edge cases
  • βœ… Mocking of external dependencies (Gemini API, IMAP, file system)

See tests/README.md for detailed test documentation.

Error Handling

The system includes robust error handling:

  • Retry Logic: Automatically retries failed API calls (timeouts, 503, 504 errors)
  • Resume Capability: Skips already-processed invoices, retries failed ones
  • JSON Parsing: Handles malformed JSON responses from Gemini
  • Comprehensive Logging: Detailed error messages and progress tracking

Performance

  • Email Fetching: Processes ~100-200 emails/second
  • Invoice Extraction:
    • Free tier: ~15 invoices/minute (rate limited)
    • Pro tier: ~50+ invoices/minute (parallel processing)
  • Success Rate: ~99.9% extraction success rate

Troubleshooting

Gmail Authentication Issues

If you get AUTHENTICATE failed or LOGIN failed:

  1. Enable 2-Step Verification: Required for App Passwords
  2. Generate App Password: Use the 16-character App Password, not your regular password
  3. Check Workspace Settings: Some Google Workspace accounts disable App Passwords (admin must enable)

Gemini API Errors

  • Rate Limits: Reduce max_workers if you hit rate limits
  • Timeout Errors: The system automatically retries timeouts
  • Invalid API Key: Check your .env.local file has the correct GEMINI_API_KEY

PDF Processing Issues

  • Missing poppler: Install poppler-utils (required for pdf2image)
  • Corrupted PDFs: Some PDFs may fail extraction - check invoices_extracted.json for error details

License

MIT License - see LICENSE file for details.

Contributing

Contributions welcome! Please:

  1. Fork the repository
  2. Create a feature branch
  3. Add tests for new functionality
  4. Ensure all tests pass: pytest tests/ -v
  5. Submit a pull request

See tests/README.md for testing guidelines.

Support

For issues and questions:

  • Open an issue on GitHub
  • Check existing issues for solutions

Acknowledgments

  • Uses Google Gemini AI for invoice extraction
  • Built with Python's standard library and open-source packages

About

Extract invoices and receipts from Gmail using Gemini AI

Resources

License

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages