A production-grade Python tool to extract invoices and receipts from Gmail emails and extract structured data using Google's Gemini AI.
- π§ Gmail Integration: Fetches emails from any Gmail inbox using IMAP
- π Smart Filtering: Filters emails by keywords (invoice, receipt, rechnung, quittung, etc.) in subject or body
- π Attachment Extraction: Automatically extracts all attachments from matching emails
- π€ AI-Powered Extraction: Uses Gemini 2.0 Flash to extract structured invoice data from PDFs
- β‘ Parallel Processing: Processes invoices in parallel for fast extraction
- π Resume Capability: Automatically resumes from where it left off
- π‘οΈ Robust Error Handling: Retry logic, comprehensive error handling, and detailed logging
- π Structured Output: Exports all extracted data to JSON with full metadata
- β Comprehensive Tests: Full test suite with 37 test cases covering all functionality
- Python 3.8+
- Gmail account with 2-Step Verification enabled
- Google App Password (for Gmail access)
- Gemini API key (for invoice extraction)
# Clone the repository
git clone https://github.com/federicodeponte/gmail-invoice-extractor.git
cd gmail-invoice-extractor
# Install dependencies
pip install -r requirements.txt
# Install system dependencies for PDF processing (macOS)
brew install poppler
# Or on Ubuntu/Debian
sudo apt-get install poppler-utilsCreate a .env.local file (or copy from .env.example):
# Gmail credentials
GMAIL_EMAIL=your.email@gmail.com
GMAIL_APP_PASSWORD=your_16_char_app_password
# Gemini API key
GEMINI_API_KEY=your_gemini_api_keyGetting a Gmail App Password:
- Go to https://myaccount.google.com/apppasswords
- Generate a new app password for "Mail"
- Copy the 16-character password
Getting a Gemini API Key:
- Go to https://makersuite.google.com/app/apikey
- Create a new API key
- Copy the key
python3 fetch_emails.pyThis will:
- Connect to your Gmail inbox
- Search for emails containing invoice/receipt keywords
- Extract all attachments to
email_attachments/directory - Show progress and summary
Filter Keywords (configurable in fetch_emails.py):
- English:
invoice,receipt,bill - German:
rechnung,quittung,beleg,zahlungsbeleg,gutschrift,abrechnung,rechnungsbeleg
python3 extract_invoices_gemini.pyThis will:
- Process all PDF files in
email_attachments/ - Extract structured invoice data using Gemini AI
- Save results to
invoices_extracted.json - Show real-time progress
Output Format (invoices_extracted.json):
{
"extraction_summary": {
"total_invoices": 887,
"successful_extractions": 886,
"failed_extractions": 1,
"extraction_date": "2025-11-07T14:30:00",
"gemini_model": "gemini-2.0-flash",
"processing_time_seconds": 1234.56,
"parallel_workers": 50
},
"invoices": [
{
"success": true,
"filename": "invoice_123.pdf",
"file_path": "email_attachments/invoice_123.pdf",
"invoice_data": {
"vendor": "Company Name",
"amount": 123.45,
"currency": "EUR",
"date": "2025-11-01",
"invoice_number": "INV-2025-001",
"description": "Services rendered",
"tax_amount": 19.45,
"tax_rate": 19.0,
"items": [
{
"name": "Service Item",
"quantity": 1,
"price": 123.45,
"total": 123.45
}
]
},
"extraction_method": "gemini-2.0-flash",
"extraction_timestamp": "2025-11-07T14:30:00",
"extraction_time": 5.23,
"image_conversion_time": 1.45,
"gemini_api_time": 3.78,
"file_size_bytes": 45678
}
]
}Edit FILTER_KEYWORDS in fetch_emails.py to customize which emails are processed:
FILTER_KEYWORDS = [
"invoice",
"rechnung",
"receipt",
"quittung",
# Add your own keywords...
]To process only invoices from a specific date range, edit extract_invoices_gemini.py:
config = ExtractionConfig(
filter_date_start="2025-10-01 00:00:00", # Only process invoices modified after this date
# ...
)Adjust the number of parallel workers in extract_invoices_gemini.py:
config = ExtractionConfig(
max_workers=50, # Increase for faster processing (depends on API limits)
# ...
)Note: Free tier Gemini API has rate limits (~15 requests/minute). Pro tier can handle much higher throughput.
The codebase follows SOLID principles and is modular:
fetch_emails.py: Gmail IMAP client for fetching emails and extracting attachmentsextract_invoices_gemini.py: Modular invoice extraction system with:ExtractionConfig: Configuration dataclass (settings, API keys, retry logic)ExtractionResult: Result dataclass (success/failure, invoice data, metadata)ConfigManager: Handles API keys and Gemini initializationInvoiceFinder: Finds and filters invoice files, manages resume capabilityInvoiceExtractor: Converts PDFs to images and calls Gemini API with retry logicExtractionProcessor: Orchestrates parallel processing with ThreadPoolExecutorResultManager: Handles result persistence (loading/saving JSON)InvoiceExtractionService: Main orchestrator tying all components together
- Single Responsibility: Each class has one clear purpose
- Dependency Injection: Config passed to all classes
- Separation of Concerns: Email fetching separate from invoice extraction
- Error Isolation: Errors handled at appropriate levels
The project includes a comprehensive test suite with 37 test cases:
# Run all tests
pytest tests/ -v
# Run with coverage report
pytest tests/ --cov=extract_invoices_gemini --cov=fetch_emails --cov-report=html
# Run specific test file
pytest tests/test_config.py -vTest Coverage:
- β Unit tests for all classes and methods
- β Integration tests for complete workflow
- β Error handling and edge cases
- β Mocking of external dependencies (Gemini API, IMAP, file system)
See tests/README.md for detailed test documentation.
The system includes robust error handling:
- Retry Logic: Automatically retries failed API calls (timeouts, 503, 504 errors)
- Resume Capability: Skips already-processed invoices, retries failed ones
- JSON Parsing: Handles malformed JSON responses from Gemini
- Comprehensive Logging: Detailed error messages and progress tracking
- Email Fetching: Processes ~100-200 emails/second
- Invoice Extraction:
- Free tier: ~15 invoices/minute (rate limited)
- Pro tier: ~50+ invoices/minute (parallel processing)
- Success Rate: ~99.9% extraction success rate
If you get AUTHENTICATE failed or LOGIN failed:
- Enable 2-Step Verification: Required for App Passwords
- Generate App Password: Use the 16-character App Password, not your regular password
- Check Workspace Settings: Some Google Workspace accounts disable App Passwords (admin must enable)
- Rate Limits: Reduce
max_workersif you hit rate limits - Timeout Errors: The system automatically retries timeouts
- Invalid API Key: Check your
.env.localfile has the correctGEMINI_API_KEY
- Missing poppler: Install
poppler-utils(required forpdf2image) - Corrupted PDFs: Some PDFs may fail extraction - check
invoices_extracted.jsonfor error details
MIT License - see LICENSE file for details.
Contributions welcome! Please:
- Fork the repository
- Create a feature branch
- Add tests for new functionality
- Ensure all tests pass:
pytest tests/ -v - Submit a pull request
See tests/README.md for testing guidelines.
For issues and questions:
- Open an issue on GitHub
- Check existing issues for solutions
- Uses Google Gemini AI for invoice extraction
- Built with Python's standard library and open-source packages