Skip to content

Enterprise-grade Retrieval Augmented Generation (RAG) system using FastAPI, Milvus, Confluence ingestion, and LLMs for internal knowledge search and Q&A.

Notifications You must be signed in to change notification settings

techadarsh/RAG-ENTERPRISE

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

11 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸš€ RAG Enterprise Chatbot

A production-ready Retrieval-Augmented Generation (RAG) chatbot system for enterprise knowledge management with native Mac deployment, Metal GPU acceleration, and 8-10x performance improvements over Docker.

🎯 Overview

This project implements a complete RAG pipeline that allows users to ask questions about enterprise documents (HR policies, onboarding guides, engineering standards) and receive contextual answers backed by retrieved sources using local LLM inference.

Key Features:

  • End-to-end RAG pipeline with resilient LLM integration
  • Native Mac deployment with Metal GPU acceleration (8-10x faster)
  • One-command automation with comprehensive health checks
  • Vector similarity search with Milvus (Docker standalone)
  • Local LLM inference via Ollama (Mistral 7B, 4.4GB model)
  • State-of-the-art embeddings (BGE-Base-En)
  • Automatic document ingestion with recursive file discovery
  • Full Confluence API integration (basic auth, pagination, CQL search)
  • Conversational memory (last 5 turns per session)
  • Health checks and service monitoring
  • Clean, minimal React UI with hot reload
  • Source attribution and latency tracking
  • 14 sample documents pre-loaded and indexed

⚑ Performance

Native Mac vs Docker:

  • Query time: 8-10 seconds (vs 60-90 seconds in Docker)
  • 8-10x performance improvement using Metal GPU
  • Memory efficient: ~8-10GB total usage
  • No container overhead for LLM inference

πŸš€ Quick Start (Local Mac Deployment)

Prerequisites

  • macOS (tested on Mac Mini M4 Pro with 48GB RAM)
  • Homebrew installed
  • Python 3.11+ (installed via Homebrew if needed)
  • Node.js 18+ (installed via Homebrew if needed)
  • Docker Desktop (for Milvus only)
  • 8GB RAM minimum (16GB+ recommended)

⚑ One-Command Startup

# Clone the repository
git clone https://github.com/techadarsh/RAG-ENTERPRISE.git
cd rag-enterprise

# Start everything (handles all prerequisites automatically)
./start_local.sh start

What it does:

  1. βœ… Checks and installs prerequisites (Homebrew, Python, Node.js, Ollama, Redis, Docker)
  2. βœ… Starts Ollama service with Metal GPU acceleration
  3. βœ… Downloads Mistral model if not present (4.4GB, one-time)
  4. βœ… Starts Redis for session caching
  5. βœ… Starts Milvus standalone container for vector storage
  6. βœ… Creates Python virtual environment and installs dependencies
  7. βœ… Starts FastAPI backend with hot reload (port 8000)
  8. βœ… Starts React frontend with hot reload (port 3000)
  9. βœ… Loads Confluence documents (via API)
  10. βœ… Performs comprehensive health checks
  11. βœ… Shows service status and access URLs

Expected startup time:

  • First run: 5-8 minutes (model download + dependencies + Confluence sync)
  • Subsequent runs with FORCE_INITIAL_LOAD=true: 2-3 minutes (loading multiple confluence documents)
  • Subsequent runs with FORCE_INITIAL_LOAD=false: 30-60 seconds (instant startup, loads in background)

Startup behavior (configurable):

The backend can start in two modes:

  1. Blocking Load (FORCE_INITIAL_LOAD=true in .env.local):

    • Backend waits to loads multiple Confluence documents before accepting requests
    • Startup time: 2-3 minutes
    • Pro: Knowledge base is immediately available for queries
    • Con: Slower startup
    • Best for: Demos, presentations, production deployments
  2. Background Load (FORCE_INITIAL_LOAD=false):

    • Backend starts immediately and loads documents in the background
    • Startup time: 30-60 seconds
    • Pro: Instant API availability
    • Con: First few queries may have limited context until loading completes
    • Best for: Development, testing, quick iterations

To change modes, edit .env.local:

FORCE_INITIAL_LOAD=true   # or false

Current default: FORCE_INITIAL_LOAD=true (blocking load for reliable demo experience)

πŸ“± Access URLs

πŸ› οΈ Available Commands

# Start all services
./start_local.sh start

# Stop all services
./start_local.sh stop

# Check service status
./start_local.sh status

# Restart all services
./start_local.sh restart

# Clean all data and reset
./start_local.sh clean

# View logs for a specific service
./start_local.sh logs backend
./start_local.sh logs frontend
./start_local.sh logs ollama
./start_local.sh logs redis
./start_local.sh logs milvus

# Show help
./start_local.sh help

πŸ’‘ Sample Queries

Try asking:

  • "What is the PTO policy?"
  • "How many holidays do we get?"
  • "What happens during onboarding week 1?"
  • "Can I rollover unused PTO?"
  • "What are the incident severity levels?"
  • "How do I create a pull request?"
  • "What is our code review process?"
  • "What is the agile workflow?"

Expected response time: 8-10 seconds with Metal GPU acceleration

πŸ”§ Architecture

System Components

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                     User Browser                             β”‚
β”‚                   http://localhost:3000                      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                         β”‚
                         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                 React Frontend (Port 3000)                   β”‚
β”‚              - Hot reload development mode                   β”‚
β”‚              - Clean, minimal UI                             β”‚
β”‚              - Conversation history                          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                         β”‚
                         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚            FastAPI Backend (Port 8000)                       β”‚
β”‚              - RAG pipeline orchestration                    β”‚
β”‚              - Embedding generation (BGE-Base-En)            β”‚
β”‚              - Vector similarity search                      β”‚
β”‚              - LLM query generation                          β”‚
β”‚              - Conversational memory (5 turns)               β”‚
β”‚              - Hot reload with Uvicorn                       β”‚
β””β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
  β”‚                   β”‚                  β”‚
  β”‚                   β”‚                  β”‚
  β–Ό                   β–Ό                  β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Ollama     β”‚  β”‚   Milvus   β”‚  β”‚    Redis     β”‚
β”‚  (Native)    β”‚  β”‚  (Docker)  β”‚  β”‚ (Homebrew)   β”‚
β”‚ Port 11434   β”‚  β”‚ Port 19530 β”‚  β”‚  Port 6379   β”‚
β”‚              β”‚  β”‚            β”‚  β”‚              β”‚
β”‚ - Mistral 7B β”‚  β”‚ - Vectors  β”‚  β”‚ - Sessions   β”‚
β”‚ - Metal GPU  β”‚  β”‚ - Metadata β”‚  β”‚ - Cache      β”‚
β”‚ - 4.4GB RAM  β”‚  β”‚ - Search   β”‚  β”‚              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Data Flow

  1. User Query β†’ Frontend sends question to /ask endpoint
  2. Embedding β†’ Backend generates query embedding using BGE-Base-En
  3. Retrieval β†’ Milvus performs vector similarity search
  4. Context β†’ Top relevant documents retrieved with metadata
  5. Generation β†’ LLM generates answer using retrieved context
  6. Response β†’ Answer + sources + latency returned to frontend
  7. Memory β†’ Conversation stored in Redis (last 5 turns)

Confluence Integration

Mode: Local (API-ready)

Configuration (.env.local):

CONFLUENCE_MODE=local
CONFLUENCE_LOCAL_DIR=data/sample_confluence_pages

API Implementation (backend/confluence_ingest.py):

  • βœ… Basic authentication (email + API token)
  • βœ… Fetch page by ID
  • βœ… Fetch all pages with pagination
  • βœ… CQL search support
  • βœ… Error handling (401, 404, timeout, connection errors)

To enable API mode:

  1. Update .env.local:
    CONFLUENCE_MODE=api
    CONFLUENCE_BASE_URL=https://your-domain.atlassian.net/wiki
    CONFLUENCE_EMAIL=your-email@company.com
    CONFLUENCE_API_TOKEN=your-api-token
  2. Restart backend: ./start_local.sh restart

πŸ§ͺ Testing & Validation

Health Checks

Check all service dependencies:

curl http://localhost:8000/health/deps

Expected response:

{
  "backend": "ok",
  "milvus": "ok",
  "etcd": "ok",
  "minio": "ok",
  "redis": "ok",
  "ollama": "ok",
  "embeddings": "ok"
}

LLM Health Check

Test LLM connectivity and generation:

curl http://localhost:8000/llm/health

Query Testing

Test RAG pipeline with a sample question:

curl -X POST http://localhost:8000/ask \
  -H 'Content-Type: application/json' \
  -d '{"query":"What is the agile workflow?"}'

Expected response time: 8-10 seconds

Service Status

Check individual service status:

./start_local.sh status

Output shows:

  • βœ… Ollama (with model info)
  • βœ… Redis (memory usage)
  • βœ… Milvus (container status)
  • βœ… Backend (process status)
  • βœ… Frontend (process status)
  • πŸ“Š Document count and topics

πŸ› Troubleshooting

Services Not Starting

Check prerequisites:

# The script checks these automatically, but you can verify manually:
which brew       # Should show Homebrew path
which python3    # Should show Python 3.11+
which node       # Should show Node.js 18+
which ollama     # Should show Ollama path
brew services list | grep redis  # Should show redis (started)
docker ps        # Should show Milvus container

Ollama Not Responding

Symptom: Backend fails with "Ollama connection error"

Solution:

# Check if Ollama is running
ps aux | grep ollama

# Restart Ollama
./start_local.sh restart

# Or manually:
brew services restart ollama
ollama serve

Milvus Connection Errors

Symptom: "Failed to connect to Milvus"

Solution:

# Check Milvus container
docker ps | grep milvus

# Check logs
./start_local.sh logs milvus

# Restart Milvus
docker restart milvus-standalone

# If corrupt, clean and restart
./start_local.sh clean
./start_local.sh start

Backend Startup Timeout

Symptom: "Backend failed to start within 120 seconds"

Cause: Topic extraction can take 50-60 seconds on first document load

Solution: This is normal! The script waits up to 120 seconds. If it still fails:

# Check backend logs
./start_local.sh logs backend

# Manually start backend to see errors
cd /Users/adarsharma/Documents/adarsharma/M.tech-4th-sem/rag-enterprise
source venv/bin/activate
python -m backend.run_local

Frontend Port Already in Use

Symptom: "Port 3000 already in use"

Solution:

# Find and kill the process using port 3000
lsof -ti:3000 | xargs kill -9

# Or restart frontend
./start_local.sh restart

Redis Connection Errors

Symptom: "Could not connect to Redis"

Solution:

# Check Redis status
brew services list | grep redis

# Restart Redis
brew services restart redis

# Test connection
redis-cli ping  # Should return "PONG"

LLM Queries Timing Out

Symptom: Queries take >60 seconds or timeout

Solution:

# Check if Metal GPU is being used
ollama ps

# Check system resources
top -l 1 | grep -E "^CPU|^PhysMem"

# Restart Ollama to clear any issues
brew services restart ollama

Documents Not Loading

Symptom: Health check shows 0 documents

Solution:

# Check if sample documents exist
ls -la data/sample_confluence_pages/

# Manually trigger document loading
curl -X POST http://localhost:8000/ingest/trigger

# Check backend logs for errors
./start_local.sh logs backend

Full Reset

If all else fails, completely reset the system:

# Stop everything
./start_local.sh stop

# Clean all data
./start_local.sh clean

# Start fresh
./start_local.sh start

Solution:

# Check logs
docker compose logs <service-name> --tail=50

# Full restart
docker compose down && docker compose up -d

πŸ“ Project Structure

rag-enterprise/
β”œβ”€β”€ πŸ“œ start_local.sh              # Main automation script (all-in-one)
β”œβ”€β”€ πŸ“„ README.md                    # This file
β”œβ”€β”€ πŸ“„ QUICKSTART.md                # Quick start guide
β”œβ”€β”€ πŸ“„ QUICK_REFERENCE_LOCAL.md     # Local deployment commands
β”œβ”€β”€ πŸ“„ LOCAL_SETUP_SUCCESS.md       # Detailed local setup documentation
β”œβ”€β”€ πŸ“„ IMPROVEMENT_AREAS.md         # Grey areas and future improvements
β”œβ”€β”€ πŸ“„ DEMO_PREP_CHECKLIST.md       # M.Tech demo preparation
β”‚
β”œβ”€β”€ πŸ“‚ backend/                     # FastAPI backend service
β”‚   β”œβ”€β”€ main.py                    # API endpoints (/health, /ask)
β”‚   β”œβ”€β”€ run_local.py               # Local deployment script
β”‚   β”œβ”€β”€ rag_pipeline.py            # RAG workflow orchestration
β”‚   β”œβ”€β”€ milvus_client.py           # Vector database operations
β”‚   β”œβ”€β”€ embeddings.py              # Embedding generation (BGE-Base-En)
β”‚   β”œβ”€β”€ llm_client.py              # LLM integration with Ollama
β”‚   β”œβ”€β”€ confluence_ingest.py       # Confluence API integration (COMPLETE)
β”‚   β”œβ”€β”€ requirements.txt           # Python dependencies
β”‚   └── .env.local                 # Local environment config
β”‚
β”œβ”€β”€ πŸ“‚ frontend/                    # React frontend service
β”‚   β”œβ”€β”€ src/
β”‚   β”‚   β”œβ”€β”€ App.js                # Main chat component
β”‚   β”‚   β”œβ”€β”€ App.css               # Styling
β”‚   β”‚   └── index.js              # React entry point
β”‚   β”œβ”€β”€ public/
β”‚   β”œβ”€β”€ package.json              # Node dependencies
β”‚   └── node_modules/             # Installed dependencies
β”‚
β”œβ”€β”€ πŸ“‚ data/                        # Sample documents
β”‚   └── sample_confluence_pages/  # 14 pre-loaded documents
β”‚       β”œβ”€β”€ agile_workflow.txt
β”‚       β”œβ”€β”€ api_best_practices.txt
β”‚       β”œβ”€β”€ code_review.txt
β”‚       β”œβ”€β”€ engineering_standards.txt
β”‚       β”œβ”€β”€ hr_policy.txt
β”‚       β”œβ”€β”€ incident_management.txt
β”‚       β”œβ”€β”€ leave_policy.txt
β”‚       β”œβ”€β”€ onboarding.txt
β”‚       β”œβ”€β”€ performance_review.txt
β”‚       β”œβ”€β”€ security_guidelines.txt
β”‚       └── ... (14 total)
β”‚
β”œβ”€β”€ πŸ“‚ docs_archive/                # Archived reference documentation
β”‚   β”œβ”€β”€ legacy/                    # 1 file: conversational memory
β”‚   β”œβ”€β”€ guides/                    # 5 files: architecture, APIs, hot reload
β”‚   └── summaries/                 # 4 files: performance, privacy, design
β”‚
β”œβ”€β”€ πŸ“‚ venv/                        # Python virtual environment
β”œβ”€β”€ πŸ“‚ volumes/                     # Milvus data persistence
β”œβ”€β”€ .env.local                      # Local environment variables
β”œβ”€β”€ .gitignore                      # Git ignore rules
β”œβ”€β”€ docker-compose.yml              # Docker services (Milvus only)
└── requirements.txt                # Python dependencies

Key Files

  • start_local.sh: Main automation script that handles everything

    • 689 lines of comprehensive automation
    • Prerequisite checking and installation
    • Service orchestration (Ollama, Redis, Milvus, Backend, Frontend)
    • Health monitoring and status reporting
    • Document loading and indexing
    • Logging and debugging support
  • backend/confluence_ingest.py: Full Confluence API implementation

    • βœ… Basic authentication (email + API token)
    • βœ… Fetch page by ID
    • βœ… Fetch all pages with pagination
    • βœ… CQL search support
    • βœ… Comprehensive error handling
  • .env.local: Local deployment configuration

    • All service hostnames (localhost, not Docker internal)
    • LLM timeouts (cold: 90s, warm: 60s)
    • Confluence mode selection (local/api)
    • Milvus standalone configuration

Documentation Organization

Root Documentation (6 files):

  • Essential guides for getting started and running the system
  • Current setup, commands, and improvement areas
  • Demo preparation checklist

Archived Documentation (10 files in docs_archive/):

  • Architecture and design documentation
  • API implementation details
  • Performance optimization strategies
  • Security and privacy documentation
  • Prompt engineering best practices

Purpose: Clean root directory for easy navigation, with valuable reference material preserved in archive.

Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   User      β”‚
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
       β”‚
       β–Ό

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ React Frontend │◄────►│ FastAPI β”‚ β”‚ (Port 3000) β”‚ β”‚ (Port 8000) β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β–Ό β–Ό β–Ό β–Ό β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Embedder β”‚ β”‚ Milvus β”‚ β”‚ LLM β”‚ β”‚ Redis β”‚ β”‚ BGE β”‚ β”‚ Vector β”‚ β”‚Clientβ”‚ β”‚ Queue β”‚ β”‚ Large-En β”‚ β”‚ DB β”‚ β”‚ β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Ingestion β”‚ β”‚ Workers β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β–Ό β–Ό β–Ό β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Folder β”‚ β”‚ S3/MinIO β”‚ β”‚ Confluence β”‚ β”‚ Watcher β”‚ β”‚ Listener β”‚ β”‚ Webhook β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ Local files Bucket events Page updates


### Request Flow

1. **User Query** β†’ Frontend sends query to backend `/api/query` endpoint
2. **Embedding** β†’ Query is embedded using BGE-Large-En model
3. **Retrieval** β†’ Top-3 similar documents retrieved from Milvus
4. **Context Building** β†’ Retrieved documents combined as context
5. **Generation** β†’ LLM generates answer based on context
6. **Response** β†’ Answer, sources, and latency returned to UI

### Ingestion Flow (Phase 1: Manual API)

1. **Document Upload** β†’ User uploads file to `/api/ingest/upload`
2. **Job Queuing** β†’ Backend saves file and publishes job to Redis
3. **Worker Processing** β†’ Ingestion worker picks up job from queue
4. **Chunking & Embedding** β†’ Worker chunks document and generates embeddings
5. **Storage** β†’ Embeddings and text inserted into Milvus
6. **Status Update** β†’ Job status updated in Redis

### Auto-Trigger Ingestion Flow (Phase 2: New!)

**Three automatic trigger mechanisms:**

####  Folder Watcher
1. User drops file in `data/incoming/` directory
2. Watcher detects new/modified file
3. Job automatically enqueued to Redis
4. Worker processes file β†’ embeds β†’ stores in Milvus

####  S3/MinIO Listener
1. File uploaded to S3/MinIO bucket (`incoming/` prefix)
2. Listener receives bucket notification event
3. File downloaded to temporary location
4. Job automatically enqueued to Redis
5. Worker processes file β†’ embeds β†’ stores in Milvus

####  Confluence Webhook
1. Page created/updated in Confluence
2. Webhook POST sent to `/api/webhook/confluence`
3. Backend extracts page URL
4. URL ingestion job enqueued to Redis
5. Worker fetches content β†’ embeds β†’ stores in Milvus

**Enable auto-triggers:**
```bash
# Set in .env
ENABLE_FOLDER_WATCHER=true
ENABLE_S3_TRIGGER=true

# Start trigger service
docker compose --profile trigger up -d

See TRIGGER_SERVICE_GUIDE.md for complete documentation.


### Request Flow

1. **User Query** β†’ Frontend sends query to backend `/ask` endpoint
2. **Embedding** β†’ Query is embedded using BGE-Large-En model
3. **Retrieval** β†’ Top-3 similar documents retrieved from Milvus
4. **Context Building** β†’ Retrieved documents combined as context
5. **Generation** β†’ LLM generates answer based on context
6. **Response** β†’ Answer, sources, and latency returned to UI

## Tech Stack & Why?

### Backend: FastAPI
- **Why?** Async support, automatic API docs, Python ecosystem
- Modern, fast, and perfect for ML/AI services
- Built-in validation with Pydantic

### Vector DB: Milvus
- **Why?** Purpose-built for vector similarity search
- ANN (Approximate Nearest Neighbor) optimization
- Handles billion-scale vectors efficiently
- Open-source and production-ready

### Embeddings: BGE-Large-En
- **Why?** State-of-the-art dense retrieval performance
- Top results on MTEB leaderboard for English
- 1024-dimensional embeddings
- Excellent zero-shot generalization

### LLM: Mistral 7B
- **Why?** Strong performance with efficient inference
- Better quality-to-cost ratio than alternatives
- Supports both mock (demo) and API modes
- Easy to swap with other models

### Frontend: React
- **Why?** Component-based, fast, widely adopted
- Simple for this use case (no complex state management)
- Great developer experience

### Orchestration: Docker Compose
- **Why?** Reproducible one-command deployment
- Multi-service management
- Consistent environments (dev/prod)
- Easy dependency handling

##  Configuration

### Environment Variables

Copy `.env.example` to `.env` to customize:

```bash
# Milvus Connection
MILVUS_HOST=milvus
MILVUS_PORT=19530
COLLECTION_NAME=enterprise_docs

# Embeddings
EMBEDDING_MODEL=all-MiniLM-L6-v2
EMBEDDING_DIM=384

# LLM Mode
LLM_MODE=mock                    # Options: mock, mistral

# Mistral API (if LLM_MODE=mistral)
MISTRAL_API_KEY=your_key_here
MISTRAL_API_URL=https://api.mistral.ai/v1/chat/completions

# Data Directory
DATA_DIR=/app/data

# Confluence Integration
CONFLUENCE_MODE=local            # Options: local, api
CONFLUENCE_LOCAL_DIR=/app/data/sample_confluence_pages
CONFLUENCE_BASE_URL=https://yourcompany.atlassian.net/wiki
CONFLUENCE_USER_EMAIL=your.email@company.com
CONFLUENCE_API_TOKEN=your_confluence_api_token
CONFLUENCE_SPACE_KEY=ENGINEERING

Confluence Integration (POC + API-Ready)

The system includes modular Confluence integration that works in two modes:

Local Mode (Current POC)

CONFLUENCE_MODE=local
  • Reads sample Confluence pages from data/sample_confluence_pages/
  • Includes realistic enterprise documentation:
    • Engineering Standards (code review, git workflow, testing)
    • Agile Workflow (sprint planning, Jira, retrospectives)
    • Incident Management (severity levels, on-call, playbooks)
    • API Documentation (authentication, endpoints, examples)
  • Perfect for dissertation/demo - no API credentials required
  • Documents are automatically indexed on startup

API Mode (Production-Ready Stub)

CONFLUENCE_MODE=api
CONFLUENCE_BASE_URL=https://yourcompany.atlassian.net/wiki
CONFLUENCE_USER_EMAIL=your.email@company.com
CONFLUENCE_API_TOKEN=your_api_token
CONFLUENCE_SPACE_KEY=ENGINEERING
  • Architecture ready for Confluence REST API integration
  • Stub methods documented with API endpoints and authentication
  • Easy to implement when API access is available
  • Demonstrates enterprise-ready design for dissertation

Why This Approach?

  • Working POC without external dependencies
  • Architecturally sound for production extension
  • Can truthfully claim Confluence integration capability
  • Sample docs demonstrate handling of real enterprise content

LLM Backend Options

The system supports 4 different LLM backends with automatic detection. Choose based on your needs:

1. Mock Mode (Default - Best for Development)

LLM_MODE=mock
  • No dependencies, instant responses
  • Perfect for testing/demos
  • Returns template with context snippets
  • Emoji indicator:

2. Ollama (Local Inference - Best for Privacy)

LLM_MODE=api
MISTRAL_API_URL=http://host.docker.internal:11434/api/generate
MISTRAL_MODEL=mistral
  • Fast local inference
  • Completely private, no data leaves your machine
  • Free (after initial setup)
  • Emoji indicator:
  • Requires: Ollama installed

3. HuggingFace Inference API (Cloud - Best for Quick Start)

LLM_MODE=api
MISTRAL_API_URL=https://api-inference.huggingface.co/models/mistralai/Mistral-7B-Instruct-v0.2
MISTRAL_API_KEY=hf_YOUR_TOKEN_HERE
MISTRAL_MODEL=mistralai/Mistral-7B-Instruct-v0.2
  • No local setup required
  • Free tier available
  • Access to many models
  • Emoji indicator:
  • Requires: HuggingFace API token

4. Mistral AI Official API (Cloud - Best for Production)

LLM_MODE=api
MISTRAL_API_URL=https://api.mistral.ai/v1/chat/completions
MISTRAL_API_KEY=your_mistral_api_key
MISTRAL_MODEL=mistral-small-latest
  • Enterprise-grade support
  • High performance
  • Emoji indicator:
  • Requires: Mistral API key (paid)

Backend Auto-Detection: The system automatically detects which backend to use based on the URL pattern:

  • Contains "ollama" or ":11434" β†’ Ollama
  • Contains "huggingface" β†’ HuggingFace
  • Other β†’ Mistral API

See LLM_BACKEND_IMPLEMENTATION.md for detailed configuration guide.

API Endpoints

Core Query API

GET /health

Health check endpoint

{
  "status": "ok"
}

POST /api/query

Process a user query with conversational memory

Request:

{
  "query": "What is the PTO policy?",
  "session_id": "user123"  // Optional, for conversation history
}

Response:

{
  "answer": "Based on the HR policies...",
  "sources": [
    {"title": "HR_Policies.txt", "text": "..."}
  ],
  "latency_ms": 1234.56,
  "session_id": "user123"
}

Document Ingestion API

The system now supports asynchronous document ingestion via a dedicated microservice. Upload documents through the REST API, and they'll be processed in the background by worker services.

POST /api/ingest/upload

Upload a document for asynchronous ingestion

Request:

curl -X POST http://localhost:8000/api/ingest/upload \
  -F "file=@document.txt"

Response:

{
  "job_id": "abc123-def456-ghi789",
  "status": "queued",
  "message": "Document 'document.txt' queued for ingestion",
  "file_path": "/app/uploads/abc123_document.txt"
}

GET /api/ingest/status/{job_id}

Check the status of an ingestion job

Request:

curl http://localhost:8000/api/ingest/status/abc123-def456-ghi789

Response (Completed):

{
  "job_id": "abc123-def456-ghi789",
  "status": "completed",
  "result": {
    "status": "success",
    "title": "document.txt",
    "chunks": 5,
    "total_characters": 12450,
    "elapsed_seconds": 23.5,
    "message": "[x] Successfully ingested: document.txt"
  }
}

Status Values:

  • queued - Job waiting in queue
  • processing - Worker is processing
  • completed - Successfully ingested
  • failed - Ingestion failed

DELETE /api/ingest/job/{job_id}

Cancel a pending or running ingestion job

Ingestion Architecture:

User Upload β†’ FastAPI Backend β†’ Redis Queue β†’ Ingestion Worker β†’ Milvus

Key Features:

  • Asynchronous processing (non-blocking)
  • Redis queue for job management
  • Scalable workers (can run multiple)
  • Job status tracking
  • Automatic chunking and embedding
  • Supports .txt and .md files

See INGESTION_API_GUIDE.md for detailed documentation.

Auto-Trigger Ingestion (Phase 2 - NEW!)

Automatically ingest documents without manual API calls. Three trigger mechanisms available:

POST /api/webhook/confluence

Receive Confluence webhook events for automatic page ingestion

Request:

{
  "event": "page_created",
  "page": {
    "id": "12345",
    "title": "Engineering Guidelines",
    "url": "https://yourcompany.atlassian.net/wiki/spaces/ENG/pages/12345"
  }
}

Response:

{
  "status": "success",
  "message": "Confluence page 'Engineering Guidelines' queued for ingestion",
  "job_id": "abc-123-def"
}

Folder Watcher

Monitor local directory for new files and automatically enqueue for ingestion.

# Enable in .env
ENABLE_FOLDER_WATCHER=true
WATCH_DIR=/app/data/incoming

# Start trigger service
docker compose --profile trigger up -d

# Drop files to auto-ingest
cp document.txt data/incoming/

Supported file types: .txt, .md, .pdf, .doc, .docx

S3/MinIO Listener

Listen to bucket events and automatically ingest uploaded files.

# Enable in .env
ENABLE_S3_TRIGGER=true
MINIO_ENDPOINT=http://minio:9000
S3_BUCKET_NAME=documents

# Start trigger service
docker compose --profile trigger up -d

# Upload to bucket β†’ automatically ingested

Quick Start:

# 1. Enable triggers in .env
echo "ENABLE_FOLDER_WATCHER=true" >> .env

# 2. Start services with trigger profile
docker compose --profile trigger up -d

# 3. Drop a file
echo "Test document" > data/incoming/test.txt

# 4. Watch it get processed
docker compose logs -f trigger

** Complete Guide:** See TRIGGER_SERVICE_GUIDE.md for:

  • Detailed setup instructions
  • Configuration reference
  • Troubleshooting guide
  • Security best practices
  • Testing procedures

POST /ask

** Deprecated:** Use /api/query instead.

Process a user query

Request:

{
  "query": "What is the PTO policy?"
}

Response:

{
  "answer": "Based on the retrieved context...",
  "sources": [
    {
      "title": "leave_policy.txt",
      "text": "Our company uses a combined PTO policy...",
      "score": 0.923
    }
  ],
  "latency_ms": 342.56
}

Development

Adding New Documents

  1. Add .txt files to the data/ directory
  2. Restart backend service:
    docker compose restart backend
  3. Documents are automatically indexed on startup if collection is empty

Running Services Individually

Backend only:

cd backend
pip install -r requirements.txt
python main.py

Frontend only:

cd frontend
npm install
npm start

Viewing Logs

# All services
docker compose logs -f

# Specific service
docker compose logs -f backend
docker compose logs -f frontend
docker compose logs -f milvus

Troubleshooting

"Connection refused" error

  • Wait 2-3 minutes for Milvus to fully initialize
  • Check Milvus health: curl http://localhost:9091/healthz

Out of memory during startup

  • Embedding model requires ~4GB RAM
  • Increase Docker memory limit in Docker Desktop settings

Port already in use

  • Stop conflicting services or change ports in docker-compose.yml

Frontend can't reach backend

  • Ensure REACT_APP_API_URL matches your backend URL
  • Check CORS settings in backend/main.py

Evaluation (Phase 3) - Dissertation Metrics

This project includes an automated evaluation module that measures system performance for dissertation reporting.

What Gets Measured

The evaluation script tests 8 predefined queries and measures:

Metric Description Expected Range
Retrieval Time Embedding generation + vector search latency 30-100 ms
Generation Time LLM inference time 1000-2500 ms
Total Latency End-to-end response time 1200-2800 ms
Relevance Score Cosine similarity of top-ranked source 70-90%

How to Run Evaluation

Option 1: Run Evaluation Script (Recommended)

# Make sure services are running
docker compose up -d

# Run evaluation (takes 2-3 minutes)
docker compose run backend python evaluate_poc.py

# Copy results to your machine
docker compose cp backend:/app/results/results.md ./backend/results/results.md

# View results
cat backend/results/results.md

Option 2: Trigger via API

# Start services
docker compose up -d

# Trigger evaluation via API
curl http://localhost:8000/evaluate

# Or visit in browser
open http://localhost:8000/evaluate

Output Format

The evaluation generates a Markdown file (results.md) with:

  1. Performance Summary Table

    | Metric | Average | Unit |
    |--------|---------|------|
    | Retrieval Time | 45.23 | ms |
    | Generation Time | 1250.67 | ms |
    | Total Latency | 1295.90 | ms |
    | Relevance Score | 82.45% | % |
  2. Detailed Query Results (8 test queries with individual metrics)

  3. Answer Previews (for qualitative analysis)

  4. System Configuration (for methodology section)

Dissertation Use

The results.md file is ready for direct inclusion in your dissertation's Results & Evaluation chapter:

πŸŽ“ Academic Context

This project is part of an M.Tech 4th semester presentation demonstrating:

  • βœ… Enterprise RAG Implementation: Complete end-to-end pipeline
  • βœ… Performance Optimization: 8-10x improvement with Metal GPU
  • βœ… Native Deployment: Moving from Docker to optimized local setup
  • βœ… Production Practices: Health checks, monitoring, error handling
  • βœ… API Integration: Full Confluence API implementation
  • βœ… Automation: One-command deployment with comprehensive checks

Performance Metrics

  • Query Response Time: 8-10 seconds (vs 60-90s in Docker)
  • Speedup: 8-10x improvement with Metal GPU acceleration
  • Memory Usage: ~8-10GB total (efficient resource utilization)
  • Document Loading: 14 documents indexed in ~5 seconds
  • Topic Extraction: ~50 seconds (one-time per session)
  • Startup Time: 30-60 seconds (after initial setup)

Technical Achievements

  1. Native Mac Optimization

    • Metal GPU acceleration for LLM inference
    • Eliminated Docker overhead for compute-intensive tasks
    • Optimized service orchestration
  2. Comprehensive Automation

    • 689-line automation script
    • Prerequisite checking and auto-installation
    • Health monitoring and status reporting
    • Intelligent timeout handling
  3. Full API Integration

    • Complete Confluence API implementation
    • Basic authentication support
    • Pagination and search capabilities
    • Comprehensive error handling
  4. Production-Ready Features

    • Conversational memory (5-turn context)
    • Hot reload for development
    • Comprehensive health checks
    • Document ingestion pipeline
    • Source attribution and latency tracking

πŸ“š Additional Documentation

  • QUICK_REFERENCE_LOCAL.md: Quick command reference
  • LOCAL_SETUP_SUCCESS.md: Detailed setup walkthrough
  • IMPROVEMENT_AREAS.md: 18 identified grey areas for future work
  • DEMO_PREP_CHECKLIST.md: M.Tech presentation preparation
  • docs_archive/: Architectural and implementation reference docs

πŸš€ Future Enhancements

Priority 1 (High Impact)

  • Advanced conversational features (conversation branches, history search)
  • Enhanced document preprocessing (better chunking strategies)
  • Query optimization (caching, compression)

Priority 2 (User Experience)

  • Multi-document upload interface
  • Real-time ingestion status
  • Query history and favorites
  • Export conversations

Priority 3 (Production)

  • User authentication and authorization
  • Multi-tenant support
  • Monitoring and analytics dashboard
  • Automated testing suite

Priority 4 (Advanced Features)

  • Multi-language support
  • Custom embedding models
  • Fine-tuning capabilities
  • Advanced RAG techniques (HyDE, multi-query)

πŸ“ License

This project is for educational/academic purposes (M.Tech dissertation).

πŸ™ Acknowledgments

  • Ollama: Local LLM inference with Metal GPU support
  • Milvus: High-performance vector database
  • FastAPI: Modern Python web framework
  • React: Frontend UI framework

Built with ❀️ for enterprise knowledge management

Last updated: November 19, 2025 Branch: local-final-presentation Status: Production-ready local deployment

About

Enterprise-grade Retrieval Augmented Generation (RAG) system using FastAPI, Milvus, Confluence ingestion, and LLMs for internal knowledge search and Q&A.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published