Skip to content

loukaspastras/AI-Data-Extractor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AI Data Extractor

Overview

AI Data Extractor is a modular automation platform designed for organizations to streamline the extraction, validation, and management of customer and invoice data from emails, web forms, and PDF invoices. The system combines rule-based and AI-powered (LLM) extraction, human-in-the-loop review, and seamless export to Google Sheets, providing a scalable, secure, and user-friendly solution for data operations.

Features

  • Upload and manage files (emails, forms, invoices) via a web dashboard
  • Automated extraction of key fields using hybrid rule-based and LLM methods
  • Manual review, editing, and approval of extracted data
  • Error and warning detection with clear user feedback
  • Export confirmed data to Google Sheets
  • Full audit trail and file lifecycle management
  • Extensible architecture for future integrations

Architecture

  • Frontend: React web application for user interaction, file management, and data review
  • Backend: Python REST API for business logic, extraction workflows, and data management
  • Database: Stores file metadata, extracted data, and audit logs securely
  • AI Extraction Engine: Supports any major LLM provider and custom rule-based logic
  • External Integrations: Google Sheets API for data export

How It Works

  1. File Upload: Users upload files (emails, forms, invoices) through the dashboard. Metadata and contents are stored securely.
  2. Extraction: The backend processes each file, extracting relevant fields using rule-based logic and LLMs. Confidence scores and warnings are generated for each field.
  3. Review & Approval: Users review extracted data in interactive tables, edit fields as needed, and approve entries. Human-in-the-loop controls ensure only validated data is finalized.
  4. State Management: Each file moves through three states—unprocessed, waiting (for review), and complete (approved). State transitions are tracked and visible in the dashboard.
  5. Export: Approved data can be exported to Google Sheets for reporting and further analysis.
  6. Audit & Compliance: All actions are logged for traceability and compliance. The system is designed to support GDPR and other regulatory requirements.

Getting Started

Prerequisites

  • Python 3.8+
  • Node.js 16+
  • Google Cloud project (for Sheets API)

Backend Setup

  1. Clone the repository:

    git clone <repo-url>
    cd AI-Data-Extractor
  2. Install Python dependencies:

    pip install -r requirements.txt
  3. Initialize the database:

    cd src\backend\database
    python db.py
  4. Start the backend server:

    cd src\backend
    uvicorn api.main:app --reload

Frontend Setup

  1. Navigate to the frontend directory:
    cd src/frontend
  2. Install Node.js dependencies:
    npm install
  3. Start the frontend development server:
    npm start

API Key Setup

To enable AI-powered extraction, set your ANTHROPIC_API_KEY environment variable with a valid API key from Anthropic (or your chosen LLM provider):

set ANTHROPIC_API_KEY=your_actual_api_key_here

Usage

  • Upload files via the dashboard
  • Review, edit, and approve extracted data
  • Export confirmed data to Google Sheets
  • Monitor file status and audit logs

Extending the System

  • Add new extraction models or document types by updating backend extraction logic
  • Integrate with other platforms (Excel, ERP, CRM) via modular API endpoints
  • Enhance security and compliance features as needed

Support & Documentation

  • See buisness-documents/technical-description/architecture.md for a full technical overview
  • For troubleshooting, consult the logs and error messages in the dashboard
  • Contact the maintainers for support or feature requests

API Documentation (Swagger)

After starting the backend server, you can access the interactive API documentation at:

http://localhost:8000/docs

This provides a full overview of all available endpoints and allows you to test API calls directly from your browser.

License

MIT License

AI Data Extractor

This project automates extraction and management of data from emails, invoices, and contact forms using Python (FastAPI, SQLAlchemy, LangChain) and React.

Features

  • Upload, view, delete, and extract data from files
  • LLM and rule-based extraction workflows
  • Dashboard and file management UI
  • Robust error handling and workflow automation

Setup

See the code and comments for backend and frontend setup instructions.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors