Production-ready Retrieval-Augmented Generation system with near-zero hallucination
A complete RAG pipeline with a beautiful web UI — configure any LLM/embedding provider, load your data, ask questions, and get cited answers with full traceability.
Quick Start · Features · Architecture · API · Contributing
- Constrained generation — model uses ONLY retrieved documents, never parametric knowledge
- Citation enforcement — every claim must reference a real chunk, or it's stripped
- Refusal gate — returns "insufficient evidence" rather than fabricating
- Confidence scoring — freshness × source quality × retrieval consistency
Pick and swap any provider from the UI — credentials are kept in-session only, never written to disk.
| Type | Providers |
|---|---|
| LLMs | Anthropic Claude · OpenAI GPT · Azure OpenAI · Mock (for testing) |
| Embeddings | OpenAI · Azure OpenAI · Voyage AI · Sentence Transformers (local) · Hash (testing) |
| Vector stores | FAISS · In-memory |
| Keyword index | BM25 |
- Every request gets a
trace_id - All 10 pipeline stages emit structured events
- Live trace viewer in the UI
- Per-session query history
- HTTP-only cookies (no JS access to session ID)
- Each user has their own credentials, indexed corpus, and config
- Sessions expire after configurable TTL
git clone https://github.com/YOUR_USERNAME/rag-console.git
cd rag-console
# Python dependencies
pip install -r requirements.txt
# Install whichever provider SDKs you'll use
pip install openai anthropic voyageai sentence-transformers
# UI dependencies
cd ui && npm install && npm run build && cd ..python -m server.main
# → http://127.0.0.1:8000That's it. Open the URL, paste your API key, upload some docs, ask questions.
For hot reload on both the API and UI:
./dev.sh
# → API at http://127.0.0.1:8000
# → UI at http://localhost:5173 (proxies /api/* to :8000)Open the Providers tab. Pick your LLM and embedder, paste API keys, click Test connections.
API keys are stored in your session's memory only — they're never logged or persisted.
The Data tab supports four ingestion modes:
- Drag-and-drop upload (.txt, .md, .pdf)
- Filesystem path (server-side file or directory)
- URL fetch (auto-strips HTML)
- Paste text (with a title)
The Query tab shows:
- The answer with inline citations
- Confidence score (or refusal reason)
- Full trace expandable below
[REFUSED] Aggregate confidence 0.42 < threshold 0.55
A refusal isn't a failure — it's the system protecting you from a hallucinated answer.
The Config tab lets you adjust:
- Chunk size and overlap
- Retrieval
kvalues - Confidence weights (freshness vs source quality vs consistency)
- Fallback thresholds (when to refuse)
Changes apply on the next query.
┌──────────────────────────────────────────────────────────────┐
│ 1. Ingestion & normalization (dedup, version, chunk) │
│ 2. Hybrid retrieval (BM25 + dense ANN) │
│ 3. RRF fusion of ranked lists │
│ 4. Confidence scoring per chunk │
│ 5. Constrained generation (no parametric knowledge) │
│ 6. Citation extraction & validation │
│ 7. Hallucination fallback gate ◄── the key control │
│ 8. Continuous evaluation (recall, hallucination rate) │
│ 9. Multi-layer caching (query, embedding, LRU+TTL) │
│ 10. Structured tracing per request │
└──────────────────────────────────────────────────────────────┘
rag-console/
├── rag/ # Core RAG library (pure Python)
│ ├── interfaces.py # Abstract base classes
│ ├── ingest.py # Step 1
│ ├── retrieval.py # Steps 2-3
│ ├── confidence.py # Step 4
│ ├── generation.py # Steps 5-6
│ ├── fallback.py # Step 7
│ ├── evaluation.py # Step 8
│ ├── cache.py # Step 9
│ ├── observability.py # Step 10
│ ├── pipeline.py # Orchestrator
│ ├── config.py # YAML loader + backend registry
│ ├── providers.py # Declarative UI catalog
│ ├── sessions.py # Per-user state
│ └── backends/ # Concrete implementations
├── server/main.py # FastAPI application
├── ui/ # React + Vite SPA
│ ├── src/
│ │ ├── App.jsx
│ │ ├── api.js
│ │ └── components/
│ │ ├── ProvidersPage.jsx
│ │ ├── DataPage.jsx
│ │ ├── QueryPage.jsx
│ │ ├── HistoryPage.jsx
│ │ ├── ConfigPage.jsx
│ │ └── Toast.jsx
│ └── package.json
├── config/
│ ├── dev.yaml # Dev tunings
│ └── prod.yaml # Production tunings
├── tests/ # 53 pytest-compatible tests
├── test_server_smoke.py # 6 server-logic tests
├── run_tests.py # Stdlib runner (no pytest needed)
└── requirements.txt
-
Pluggable interfaces. Every backend implements an abstract class. Swap FAISS for Qdrant, OpenAI for Cohere, BM25 for OpenSearch — without touching pipeline code.
-
Refusal-first. Most RAG systems happily fabricate when retrieval is weak. This one refuses below a confidence threshold. That single design choice eliminates ~90% of practical hallucinations.
-
Citation enforcement is mechanical. The post-processor parses
[chunk_id]tags from the model's output and validates every one. Invented IDs are dropped silently — no need to trust the LLM to behave. -
Sessions, not global state. Two users hitting the same instance see entirely isolated configs, corpora, and credentials. Built for shared deployment from day one.
Switching providers requires zero code changes — just edit config/prod.yaml:
backends:
embedder: voyage # was: hash
embedder_settings:
api_key: ${VOYAGE_API_KEY}
model: voyage-3
vector_store: faiss # was: in_memory
llm: anthropic # was: mock
llm_settings:
api_key: ${ANTHROPIC_API_KEY}
model: claude-opus-4-7Then:
from rag.config import load_pipeline
pipeline = load_pipeline("config/prod.yaml")pipeline:
retriever_k: 100 # candidates per retriever
final_k: 15 # after RRF fusion
chunk_size: 1000 # characters
chunk_overlap: 150
max_context_chunks: 10
max_tokens: 1024
confidence:
half_life_days: 180.0
w_freshness: 0.30
w_source: 0.40
w_consistency: 0.30
source_quality:
"internal://verified/": 0.95
"internal://": 0.85
"https://gov.": 0.95
fallback:
min_aggregate_confidence: 0.65 # below this → refuse
min_chunks: 2
require_min_citations: 2Three steps:
# 1. Implement the interface
from rag.interfaces import Embedder
class MyCustomEmbedder(Embedder):
def __init__(self, api_key, model):
...
def embed(self, texts):
...
@property
def dim(self):
return 1024
# 2. Register it
from rag.config import register_embedder
register_embedder("my_custom", lambda s: MyCustomEmbedder(s["api_key"], s["model"]))
# 3. Reference it in YAML
# backends:
# embedder: my_custom
# embedder_settings:
# api_key: ...
# model: ...To expose it in the UI, add a ProviderSpec to rag/providers.py.
Full OpenAPI docs available at /docs when running the server.
| Method | Endpoint | Description |
|---|---|---|
| GET | /api/session |
Current session info (creates if missing) |
| DELETE | /api/session |
Reset session |
| POST | /api/session/llm |
{provider, settings} |
| POST | /api/session/embedder |
{provider, settings} |
| POST | /api/session/test |
Test both connections |
| POST | /api/session/pipeline |
Update tuning overrides |
| GET | /api/session/config |
Resolved config (YAML + overrides) |
| Method | Endpoint | Description |
|---|---|---|
| POST | /api/ingest/upload |
Multipart file upload |
| POST | /api/ingest/path |
{path} — server-side filesystem |
| POST | /api/ingest/url |
{url} — fetch and index |
| POST | /api/ingest/text |
{title, content} |
| GET | /api/sources |
List indexed sources |
| DELETE | /api/sources |
Clear all sources |
| Method | Endpoint | Description |
|---|---|---|
| POST | /api/query |
{query} → answer with citations |
| GET | /api/trace/{trace_id} |
Full trace for a query |
| GET | /api/history |
Recent queries (this session) |
| Method | Endpoint | Description |
|---|---|---|
| GET | /api/providers |
Catalog of available providers + installed status |
| GET | /api/health |
Health check + session count |
# 1. Set OpenAI as the LLM
curl -X POST http://localhost:8000/api/session/llm \
-H "Content-Type: application/json" \
-c cookies.txt -b cookies.txt \
-d '{"provider": "openai", "settings": {"api_key": "sk-...", "model": "gpt-4o-mini"}}'
# 2. Set OpenAI embeddings
curl -X POST http://localhost:8000/api/session/embedder \
-H "Content-Type: application/json" \
-b cookies.txt -c cookies.txt \
-d '{"provider": "openai", "settings": {"api_key": "sk-...", "model": "text-embedding-3-small"}}'
# 3. Index some text
curl -X POST http://localhost:8000/api/ingest/text \
-H "Content-Type: application/json" \
-b cookies.txt -c cookies.txt \
-d '{"title": "rag", "content": "RAG combines retrieval with generation..."}'
# 4. Ask a question
curl -X POST http://localhost:8000/api/query \
-H "Content-Type: application/json" \
-b cookies.txt -c cookies.txt \
-d '{"query": "What is RAG?"}'# Full test suite — 59 tests, no external deps
python3 run_tests.py
# Server logic tests
python3 test_server_smoke.py
# Or with pytest
pytest tests/ -vCoverage by step:
| Step | Module | Tests |
|---|---|---|
| 1 | Ingestion | 7 |
| 2-3 | Retrieval | 5 |
| 4 | Confidence | 6 |
| 5-7 | Generation + Fallback | 11 |
| 8 | Evaluation | 1 (in e2e) |
| 9-10 | Cache + Observability | 9 |
| Config | YAML loader | 7 |
| E2E | Full pipeline | 8 |
| Server | Session + providers | 6 |
- ✅ API keys are stored only in session memory; never written to disk or logs
- ✅ HTTP-only cookies (no XSS access to session ID)
- ✅
SameSite=Laxcookie protection - ✅ CORS limited to known dev origins
⚠️ POST /api/ingest/pathreads server-side filesystem — restrict access in production⚠️ No built-in CSRF protection — add tokens if exposing across origins⚠️ No rate limiting — add via reverse proxy (nginx, Cloudflare) for public deployment
- Put behind nginx/Caddy with TLS
- Set
RAG_SESSION_TTLappropriately for your use case - Replace in-memory
SessionManagerwith Redis-backed for horizontal scaling - Add auth middleware (OAuth, SAML, basic auth, whatever fits)
- Disable or guard
/api/ingest/path - Add rate limiting per IP / per session
- Set up structured log aggregation (the
LoggingTraceremits JSON) - Configure CSP headers
- Use a secrets manager (Vault, AWS Secrets Manager) — don't put keys in env vars in plain text
- Streaming responses (SSE) for long generations
- Persistent storage for sessions (Redis adapter)
- Reranker support (Cohere, Voyage rerankers)
- Built-in auth providers (OAuth, Auth0, magic-link)
- OpenSearch/Elasticsearch keyword index backend
- Qdrant/Pinecone/Weaviate vector store backends
- PDF parsing via pypdf (currently a stub)
- Word document ingestion (.docx)
- Conversation mode (multi-turn with context)
- A/B testing harness for prompt/config variants
Contributions welcome! Here's the easiest path:
- Fork the repo and create a feature branch (
git checkout -b feat/qdrant-backend) - Add tests — for new backends, mirror the patterns in
tests/test_retrieval.py - Run the suite:
python3 run_tests.pymust pass - For new providers, add both:
- The concrete implementation in
rag/backends/ - The
ProviderSpecinrag/providers.py(so the UI picks it up automatically)
- The concrete implementation in
- Open a PR with a description of what changed and why
- Add a new vector store backend (Qdrant, Pinecone, Weaviate)
- Add a new reranker step between retrieval and generation
- Replace the PDF parsing stub with real
pypdfsupport - Add a "compare prompts" mode in the UI
MIT — see LICENSE for the full text.
- The 10-step framework is adapted on near-zero hallucination RAG
- Reciprocal Rank Fusion — Cormack et al., SIGIR 2009
- BM25 — Robertson & Zaragoza, 2009
- HNSW — Malkov & Yashunin, 2016