DocRAG - Technical Document Generation and Compliance System

A production-ready RAG (Retrieval-Augmented Generation) system for generating and validating technical documentation. The system ingests technical source documents, extracts their structure, and generates new technical documents that follow predefined formats with automated validation and compliance checking.

Features

Document Parsing: Extract and clean text from PDF files
Intelligent Chunking: Split documents into manageable chunks with overlap
Semantic Search: FAISS-based vector storage and retrieval
Document Generation: LLM-powered technical document generation using Ollama
Automated Validation: Check document completeness and structural compliance
PDF Export: Generate formatted PDF documents and validation reports
RESTful API: FastAPI-based backend with comprehensive endpoints
Web Interface: Streamlit-based UI for easy interaction

Architecture

The system follows a modular architecture with clear separation of concerns:

DocRAG/
├── src/
│   ├── parse/          # Document parsing and chunking
│   ├── embed/           # Embedding generation and FAISS indexing
│   ├── retrieval/       # RAG retrieval system
│   ├── generation/      # LLM integration and prompt building
│   ├── validation/      # Document validation and compliance checking
│   ├── api/             # FastAPI REST endpoints
│   ├── ui/              # Streamlit web interface
│   └── utils/           # Utility functions (PDF export, etc.)
├── models/              # Model storage
├── docs/                # Documentation and example documents
├── data/                # Data storage (gitignored)
└── tests/               # Unit and integration tests

Pipeline Flow

Document Ingestion: PDF files are parsed and cleaned
Chunking: Text is split into overlapping chunks
Embedding: Chunks are converted to vectors using sentence-transformers
Indexing: Vectors are stored in FAISS for efficient retrieval
Retrieval: Query-based semantic search retrieves relevant chunks
Generation: LLM generates documents based on retrieved context
Validation: Generated documents are checked for completeness and compliance
Export: Documents and reports can be exported as PDF

Installation

Prerequisites

Python 3.8 or higher
Ollama installed and configured (for LLM generation)
- Download from: https://ollama.ai/
- Install and ensure it's in your PATH
- Pull a model: ollama pull llama3

Setup

Clone the repository:

git clone https://github.com/I2S9/DocRAG.git
cd DocRAG

Install dependencies:

pip install -r requirements.txt

Verify installation:

python -m src.main

Usage

Quick Start

Start the API server:

python start_api.py
# Or: uvicorn src.api.app:app --reload

API available at http://localhost:8000

Start the web interface (in a separate terminal):

python run_ui.py
# Or: streamlit run src/ui/app.py

Interface available at http://localhost:8501

Workflow

Upload a PDF document through the web interface
Enter a generation query (e.g., "Generate a technical specification for component X")
Review the generated document and validation report
Export documents or reports as PDF if needed

API Endpoints

GET `/`

Root endpoint returning API status.

POST `/index`

Index a PDF document for retrieval.

Request: Multipart form data with PDF file Response:

{
  "message": "Document 'filename.pdf' indexed successfully",
  "chunks_count": 33
}

POST `/generate`

Generate a technical document based on a query.

Request:

{
  "query": "Generate a technical specification for component X"
}

Response:

{
  "document": "Generated document text...",
  "validation": {
    "all_sections_present": true,
    "sections": {
      "Introduction": true,
      "Scope": true,
      "Requirements": true,
      "Constraints": true,
      "Safety considerations": true
    }
  }
}

POST `/validate`

Validate a document's structure and completeness.

Request:

{
  "text": "Document text to validate..."
}

Response:

{
  "validation": {
    "all_sections_present": false,
    "sections": {
      "Introduction": true,
      "Scope": true,
      "Requirements": false,
      "Constraints": false,
      "Safety considerations": false
    }
  }
}

POST `/export`

Export a document or validation report to PDF.

Request:

{
  "text": "Document text to export...",
  "title": "Technical Document",
  "export_type": "document"
}

Response: PDF file download

Testing

Run unit tests:

python -m pytest tests/test_unit_*.py -v

Run integration tests (requires PDF files in docs/ directory):

python tests/test_parsing.py
python tests/test_validation.py

Requirements

Required Sections for Technical Documents

The validation system checks for the following required sections:

Introduction
Scope
Requirements
Constraints
Safety considerations

Model Requirements

Embedding Model: sentence-transformers/all-MiniLM-L6-v2 (default, ~80MB)
LLM: Ollama with a compatible model (e.g., llama3, llama3:8b)

System Requirements

Minimum 4GB RAM (8GB recommended)
Python 3.8+
Ollama installed and configured
Internet connection for initial model downloads

Limitations

PDF parsing quality depends on PDF structure
LLM generation speed depends on hardware and model size
Validation rules are configurable but currently fixed
FAISS index is in-memory (not persisted between restarts)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DocRAG - Technical Document Generation and Compliance System

Features

Architecture

Pipeline Flow

Installation

Prerequisites

Setup

Usage

Quick Start

Workflow

API Endpoints

GET `/`

POST `/index`

POST `/generate`

POST `/validate`

POST `/export`

Testing

Requirements

Required Sections for Technical Documents

Model Requirements

System Requirements

Limitations

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
docs		docs
models		models
src		src
tests		tests
.gitignore		.gitignore
DocRAG.png		DocRAG.png
README.md		README.md
pytest.ini		pytest.ini
requirements.txt		requirements.txt
run_ui.py		run_ui.py
start_api.py		start_api.py

I2S9/DocRAG

Folders and files

Latest commit

History

Repository files navigation

DocRAG - Technical Document Generation and Compliance System

Features

Architecture

Pipeline Flow

Installation

Prerequisites

Setup

Usage

Quick Start

Workflow

API Endpoints

GET /

POST /index

POST /generate

POST /validate

POST /export

Testing

Requirements

Required Sections for Technical Documents

Model Requirements

System Requirements

Limitations

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

GET `/`

POST `/index`

POST `/generate`

POST `/validate`

POST `/export`

Packages