A production-ready RAG (Retrieval-Augmented Generation) system for generating and validating technical documentation. The system ingests technical source documents, extracts their structure, and generates new technical documents that follow predefined formats with automated validation and compliance checking.
- Document Parsing: Extract and clean text from PDF files
- Intelligent Chunking: Split documents into manageable chunks with overlap
- Semantic Search: FAISS-based vector storage and retrieval
- Document Generation: LLM-powered technical document generation using Ollama
- Automated Validation: Check document completeness and structural compliance
- PDF Export: Generate formatted PDF documents and validation reports
- RESTful API: FastAPI-based backend with comprehensive endpoints
- Web Interface: Streamlit-based UI for easy interaction
The system follows a modular architecture with clear separation of concerns:
DocRAG/
├── src/
│ ├── parse/ # Document parsing and chunking
│ ├── embed/ # Embedding generation and FAISS indexing
│ ├── retrieval/ # RAG retrieval system
│ ├── generation/ # LLM integration and prompt building
│ ├── validation/ # Document validation and compliance checking
│ ├── api/ # FastAPI REST endpoints
│ ├── ui/ # Streamlit web interface
│ └── utils/ # Utility functions (PDF export, etc.)
├── models/ # Model storage
├── docs/ # Documentation and example documents
├── data/ # Data storage (gitignored)
└── tests/ # Unit and integration tests
- Document Ingestion: PDF files are parsed and cleaned
- Chunking: Text is split into overlapping chunks
- Embedding: Chunks are converted to vectors using sentence-transformers
- Indexing: Vectors are stored in FAISS for efficient retrieval
- Retrieval: Query-based semantic search retrieves relevant chunks
- Generation: LLM generates documents based on retrieved context
- Validation: Generated documents are checked for completeness and compliance
- Export: Documents and reports can be exported as PDF
- Python 3.8 or higher
- Ollama installed and configured (for LLM generation)
- Download from: https://ollama.ai/
- Install and ensure it's in your PATH
- Pull a model:
ollama pull llama3
- Clone the repository:
git clone https://github.com/I2S9/DocRAG.git
cd DocRAG- Install dependencies:
pip install -r requirements.txt- Verify installation:
python -m src.mainStart the API server:
python start_api.py
# Or: uvicorn src.api.app:app --reloadAPI available at http://localhost:8000
Start the web interface (in a separate terminal):
python run_ui.py
# Or: streamlit run src/ui/app.pyInterface available at http://localhost:8501
- Upload a PDF document through the web interface
- Enter a generation query (e.g., "Generate a technical specification for component X")
- Review the generated document and validation report
- Export documents or reports as PDF if needed
Root endpoint returning API status.
Index a PDF document for retrieval.
Request: Multipart form data with PDF file Response:
{
"message": "Document 'filename.pdf' indexed successfully",
"chunks_count": 33
}Generate a technical document based on a query.
Request:
{
"query": "Generate a technical specification for component X"
}Response:
{
"document": "Generated document text...",
"validation": {
"all_sections_present": true,
"sections": {
"Introduction": true,
"Scope": true,
"Requirements": true,
"Constraints": true,
"Safety considerations": true
}
}
}Validate a document's structure and completeness.
Request:
{
"text": "Document text to validate..."
}Response:
{
"validation": {
"all_sections_present": false,
"sections": {
"Introduction": true,
"Scope": true,
"Requirements": false,
"Constraints": false,
"Safety considerations": false
}
}
}Export a document or validation report to PDF.
Request:
{
"text": "Document text to export...",
"title": "Technical Document",
"export_type": "document"
}Response: PDF file download
Run unit tests:
python -m pytest tests/test_unit_*.py -vRun integration tests (requires PDF files in docs/ directory):
python tests/test_parsing.py
python tests/test_validation.pyThe validation system checks for the following required sections:
- Introduction
- Scope
- Requirements
- Constraints
- Safety considerations
- Embedding Model:
sentence-transformers/all-MiniLM-L6-v2(default, ~80MB) - LLM: Ollama with a compatible model (e.g.,
llama3,llama3:8b)
- Minimum 4GB RAM (8GB recommended)
- Python 3.8+
- Ollama installed and configured
- Internet connection for initial model downloads
- PDF parsing quality depends on PDF structure
- LLM generation speed depends on hardware and model size
- Validation rules are configurable but currently fixed
- FAISS index is in-memory (not persisted between restarts)
