A comprehensive Retrieval-Augmented Generation (RAG) system that can process PDFs, images, and scanned documents to extract and retrieve information from tables, charts, and mixed text-image content.
- PDF Processing: Extract text, tables, and images from PDF documents
- Image Processing: OCR for scanned documents and image-based content
- Mixed Content: Handle documents with text, tables, charts, and images
- Table Extraction: Automatically detect and extract data from tables
- Chart Recognition: Identify and analyze charts and graphs
- OCR Integration: High-accuracy text extraction from scanned documents
- Visual Element Recognition: Index and search visual elements
- Vector Database: FAISS for efficient similarity search
- Embedding Models: Sentence Transformers for semantic understanding
- Chunking Strategies: Intelligent document chunking for optimal retrieval
- Context-Aware Responses: Generate relevant answers based on retrieved context
- Document Processor: Handles PDF, image, and scanned document ingestion
- OCR Engine: Tesseract + EasyOCR for text extraction
- Table Extractor: PDFPlumber + OpenCV for table detection
- Vector Database: FAISS for storing and retrieving embeddings
- RAG Pipeline: LangChain for retrieval-augmented generation
- Web Interface: Streamlit for user-friendly interaction
- Python 3.8+
- Streamlit: Web application framework
- LangChain: RAG pipeline orchestration
- FAISS: Vector database
- Sentence Transformers: Embedding generation
- OpenCV: Image processing
- Tesseract/EasyOCR: OCR capabilities
- PDFPlumber: PDF text and table extraction
- Clone the repository
git clone <repository-url>
cd visual-document-rag- Install dependencies
pip install -r requirements.txt- Install Tesseract OCR
# Ubuntu/Debian
sudo apt-get install tesseract-ocr
# macOS
brew install tesseract
# Windows
# Download from https://github.com/UB-Mannheim/tesseract/wiki- Set up environment variables
cp .env.example .env
# Edit .env with your OpenAI API key- Fork this repository
- Go to share.streamlit.io
- Connect your GitHub account
- Select this repository
- Deploy!
- Upload Documents: Upload PDFs, images, or scanned documents
- Process Documents: The system automatically extracts text, tables, and visual elements
- Ask Questions: Query the system about the uploaded documents
- Get Answers: Receive context-aware responses with source citations
The system evaluates performance using:
- Retrieval Accuracy: Precision and recall of relevant document chunks
- Response Relevance: Quality of generated answers
- Processing Speed: Document processing and query response times
- OCR Accuracy: Text extraction quality from scanned documents
visual-document-rag/
├── app.py # Main Streamlit application
├── document_processor.py # Document processing pipeline
├── ocr_engine.py # OCR and text extraction
├── table_extractor.py # Table detection and extraction
├── vector_store.py # FAISS integration
├── rag_pipeline.py # RAG query processing
├── utils.py # Utility functions
├── requirements.txt # Python dependencies
├── README.md # Project documentation
└── .env.example # Environment variables template
OPENAI_API_KEY: Your OpenAI API key for text generationFAISS_PERSIST_DIRECTORY: Directory for FAISS persistenceMODEL_NAME: Sentence transformer model name
- Chunk Size: 1000 characters with 200 character overlap
- Top-k Retrieval: 5 most relevant chunks
- Contract analysis and clause extraction
- Legal document search and retrieval
- Case law document processing
- Medical report analysis
- Patient record processing
- Research paper information extraction
- Financial report analysis
- Invoice and receipt processing
- Regulatory document compliance
- Research paper analysis
- Textbook content extraction
- Academic document processing