docker compose upPDF Upload & Processing
-
PDF Upload (FastAPI /upload endpoint)
- File validation and UUID generation
- Save file and store metadata in MongoDB
- Queue background processing task
-
PDF Processing Pipeline
- Extract text, images, and tables using PyMuPDF, Camelot, PDFPlumber
- Generate markdown with image placeholders
- Run OCR on images using PaddleOCR
- Clean and normalize markdown content
- Process diagrams using LLaVA vision model
- Split content into chunks and create vector documents
- Generate document summary using Ollama LLM
- Store documents in ChromaDB vector store
Vector Storage & Indexing
-
Vector Store (ChromaDB)
- Persistent storage with document embeddings
- Metadata tracking (file_id, chunk_index, source_file)
- Embeddings generated by nomic-embed-text model
-
RAG Pipeline (LlamaIndex)
- Create index from ChromaDB documents
- Configure with Ollama LLM and embeddings
- Set up query engine for similarity search
Question Answering
- Question Processing (FastAPI /ask endpoint)
- Receive question and query vector store
- Retrieve top-k similar chunks
- Generate answer using Ollama LLM with retrieved context
- Return structured response with sources and confidence score
Supporting Services
- Core Services
- Ollama Service: LLM (llama3.2:3b), embeddings (nomic-embed-text), vision (llava:7b)
- MongoDB: File metadata and status tracking
- OCR Service: PaddleOCR for image text extraction
- Markdown Cleaner: Content normalization
Data Flow: PDF -> Text/Image Extraction -> Markdown -> OCR -> Cleaning -> Chunking -> Vector Store -> Question -> Similarity Search -> LLM Response
