Hybrid Search RAG Engine

Production-grade Retrieval-Augmented Generation API for long-document QA. Combines FAISS dense search and BM25 keyword search fused with Reciprocal Rank Fusion, powered by GPT-4o.

Architecture

┌─────────────────────────────────────────────────────────────────────────┐
│                        INGESTION PIPELINE                               │
│                                                                         │
│  PDF/DOCX/TXT ──► loader.py ──► chunker.py ──► embedder.py             │
│                   (PageRecord)  (SemanticChunker  (OpenAI               │
│                                  95th-pct split)   text-emb-3-small)    │
│                                       │                                 │
│                          ┌────────────┴────────────┐                   │
│                          ▼                         ▼                   │
│                   vector_store.py           bm25_store.py              │
│                   (FAISS IndexFlatIP)        (BM25Okapi)               │
│                   data/vector_store/         data/bm25_store/          │
└─────────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────────┐
│                         QUERY PIPELINE                                  │
│                                                                         │
│  Question ──► get_query_embedding()                                     │
│                    │ (one API call)                                      │
│         ┌──────────┴──────────┐                                         │
│         ▼                     ▼                                         │
│   VectorStore             BM25Store                                     │
│   .search(k=10)           .search(k=10)                                 │
│   cosine ANN              token TF-IDF                                  │
│         └──────────┬──────────┘                                         │
│                    ▼                                                     │
│          reciprocal_rank_fusion()                                        │
│          score(d) = Σ 1/(rank_i(d) + 60)                               │
│                    │                                                     │
│                    ▼                                                     │
│            top-5 fused chunks                                            │
│                    │                                                     │
│                    ▼                                                     │
│           RAGGenerator.generate()                                        │
│           GPT-4o · temp=0 · strict citation prompt                     │
│                    │                                                     │
│                    ▼                                                     │
│   {"answer": "...", "sources": [...], "confidence_score": 0.94}        │
└─────────────────────────────────────────────────────────────────────────┘

Tech Stack

Layer	Library	Version
Embedding	OpenAI text-embedding-3-small	via openai 1.75.0
Generation	GPT-4o (temp=0)	via langchain-openai 1.1.11
Dense index	FAISS IndexFlatIP / IVFFlat	faiss-cpu 1.9.0
Sparse index	BM25Okapi (k1=1.5, b=0.75)	rank-bm25 0.2.2
Chunking	SemanticChunker (95th-pct)	langchain-experimental 0.3.4
Chain	LCEL RunnableSequence	langchain-core 1.2.18
API	FastAPI + uvicorn	0.115.14 / 0.34.3
Validation	Pydantic v2	2.11.4
Retry	tenacity	9.1.4

Quickstart

1. Clone and install

git clone <repo>
cd rag-engine
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt

2. Configure

cp .env.example .env
# Edit .env and set OPENAI_API_KEY=sk-...

3. Run

uvicorn app.main:app --reload --port 8000

OpenAPI docs at http://localhost:8000/docs.

4. Docker (recommended for production)

cp .env.example .env        # fill in OPENAI_API_KEY
docker compose up --build -d
docker compose logs -f
docker compose down         # stop (data volume preserved)
docker compose down -v      # stop + wipe all index data

API Reference

POST /ingest

Upload a PDF, DOCX, or TXT file. Additive — each call accumulates into the same corpus without replacing prior documents.

curl -X POST http://localhost:8000/ingest \
  -F "file=@annual_report.pdf"

{
  "doc_id": "annual_report.pdf",
  "filename": "annual_report.pdf",
  "pages_loaded": 47,
  "chunks_indexed": 112,
  "message": "'annual_report.pdf' ingested successfully. 47 pages → 112 chunks indexed."
}

POST /query

curl -X POST http://localhost:8000/query \
  -H "Content-Type: application/json" \
  -d '{"question": "What was the EBITDA margin in Q3 2023?"}'

{
  "question": "What was the EBITDA margin in Q3 2023?",
  "answer": "The EBITDA margin in Q3 2023 was 18.5%, driven by operational efficiency improvements [Source: annual_report.pdf, page 15].",
  "sources": [
    {"filename": "annual_report.pdf", "page_num": 15, "chunk_id": "annual_report.pdf::p15::c2"}
  ],
  "confidence_score": 0.94,
  "can_answer": true,
  "model": "gpt-4o",
  "retrieved_chunks": []
}

Include retrieved chunks for debugging retrieval quality:

curl -X POST http://localhost:8000/query \
  -H "Content-Type: application/json" \
  -d '{"question": "What was revenue?", "top_n": 5, "include_chunks": true}'

When the question is outside the corpus:

{
  "answer": "I cannot find this in the provided documents",
  "sources": [],
  "can_answer": false
}

GET /health

curl http://localhost:8000/health

{
  "status": "ok",
  "version": "1.0.0",
  "vector_store_size": 1124,
  "bm25_store_size": 1124,
  "timestamp": "2024-01-15T14:32:07.114"
}

status is "degraded" when no documents have been ingested or when the two index sizes are out of sync.

Benchmark Results

Hit Rate@5 on a 20-query synthetic corpus (30 chunks, 5 topics):

────────────────────────────────────────────────────────────────────────────────
                       RETRIEVAL BENCHMARK — HIT RATE @ 5
────────────────────────────────────────────────────────────────────────────────
  HYBRID | BM25 | QUERY
────────────────────────────────────────────────────────────────────────────────
  HIT  | BM25=HIT  | Q3 2023 revenue 4.2 billion dollars           finance
  HIT  | BM25=HIT  | EBITDA margin 18.5 percent                    finance
  HIT  | BM25=HIT  | CRISPR-Cas9 genomic editing                   bio
  HIT  | BM25=HIT  | AlphaFold protein structure prediction        bio
  HIT  | BM25=HIT  | GDPR Article 17 right to erasure              legal
  HIT  | BM25=HIT  | Kubernetes microservices orchestration        tech
  HIT  | BM25=HIT  | CUDA GPU matrix multiplication                tech
  HIT  | BM25=HIT  | Gutenberg printing press 1440                 history
  HIT  | BM25=HIT  | messenger RNA spike protein antibodies        bio
  HIT  | BM25=HIT  | force majeure contractual impossibility       legal
  HIT  | BM25=MISS | How did company earnings perform last quarter finance  ← hybrid wins
  HIT  | BM25=HIT  | profitability efficiency measured by EBITDA   finance
  HIT  | BM25=HIT  | CRISPR bacterial gene editing technology      bio
  HIT  | BM25=HIT  | Predicting how proteins fold in 3D            bio
  HIT  | BM25=HIT  | Personal data deletion rights (European law)  legal
  HIT  | BM25=HIT  | Container scheduling across cloud infra       tech
  HIT  | BM25=HIT  | Parallel computation on graphics hardware     tech
  HIT  | BM25=HIT  | How knowledge spread before the internet      history
  MISS | BM25=MISS | Vaccine technology using genetic instructions  bio
  HIT  | BM25=HIT  | Contract clause excusing performance          legal
────────────────────────────────────────────────────────────────────────────────
  Hybrid=95% (19/20)   BM25-only=90% (18/20)   FAISS-only=85% (17/20)
  Hybrid gain: +1 vs BM25, +2 vs FAISS
────────────────────────────────────────────────────────────────────────────────

The hybrid-exclusive hit ("How did company earnings perform last quarter" → revenue chunk) demonstrates the RRF value: FAISS finds it via semantic similarity while BM25 misses due to zero lexical overlap with "Q3 revenue".

Run the benchmark:

pytest tests/test_retrieval.py -v -s

Project Structure

rag-engine/
├── app/
│   ├── main.py                 # FastAPI app: /ingest, /query, /health
│   ├── schemas.py              # Pydantic v2 request/response models
│   ├── ingestion/
│   │   ├── loader.py           # PDF / DOCX / TXT → PageRecord list
│   │   ├── chunker.py          # SemanticChunker → ChunkRecord list
│   │   └── embedder.py         # Batch embed → EmbeddedChunk list
│   ├── retrieval/
│   │   ├── vector_store.py     # FAISS index (build, search, save, load)
│   │   ├── bm25_store.py       # BM25Okapi index (build, search, save, load)
│   │   └── hybrid.py           # RRF fusion + HybridRetriever
│   └── generation/
│       └── llm.py              # LCEL chain: prompt | GPT-4o | citation parser
├── data/
│   ├── vector_store/           # faiss.index + chunks.pkl (auto-created)
│   └── bm25_store/             # bm25.pkl (auto-created)
├── tests/
│   └── test_retrieval.py       # HR@5 benchmark: hybrid vs BM25 vs FAISS
├── Dockerfile                  # Multi-stage build (builder + runtime)
├── docker-compose.yml          # Single-service compose with named volume
├── .env.example                # Config template
├── requirements.txt            # Pinned dependencies
└── README.md

Configuration Reference

Variable	Default	Description
`OPENAI_API_KEY`	—	Required. OpenAI API key
`GENERATION_MODEL`	`gpt-4o`	LLM for answer generation
`GENERATION_TEMPERATURE`	`0`	Sampling temperature
`GENERATION_MAX_TOKENS`	`1024`	Max tokens in generated answer
`GENERATION_TIMEOUT`	`60`	OpenAI request timeout (seconds)
`VECTOR_STORE_DIR`	`data/vector_store`	FAISS persistence path
`BM25_STORE_DIR`	`data/bm25_store`	BM25 persistence path
`HYBRID_K`	`10`	Candidates per retriever
`HYBRID_TOP_N`	`5`	Final chunks passed to LLM
`HYBRID_RRF_K`	`60`	RRF smoothing constant
`BM25_K1`	`1.5`	BM25 term frequency saturation
`BM25_B`	`0.75`	BM25 document length normalisation
`SEMANTIC_THRESHOLD_PERCENTILE`	`95`	Cosine distance percentile for splits
`LOG_LEVEL`	`info`	debug / info / warning / error
`APP_VERSION`	`1.0.0`	Version string in /health

Design Notes

Why --workers 1? FAISS's C++ index is not fork-safe. Scale horizontally with multiple containers behind a load balancer instead of multiple workers per container.

Why semantic chunking? Fixed 512-token windows slice sentences mid-thought. SemanticChunker detects topic-shift boundaries via cosine distance spikes, producing one-complete-idea chunks. The LLM receives coherent passages, not sentence fragments.

Why BM25 alongside FAISS? Dense embeddings smear rare tokens — "EBITDA", "CRISPR-Cas9", "Article 17" map to broad semantic regions shared by adjacent-but-wrong terms. BM25's exact token matching catches them precisely. RRF promotes chunks both retrievers agree on, the strongest multi-evidence signal for relevance.

Why the sentinel string? Calibrating a confidence threshold requires labelled queries. The sentinel "I cannot find this in the provided documents" is model-enforced via the system prompt — deterministic, testable, and requires no per-deployment calibration.

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hybrid Search RAG Engine

Architecture

Tech Stack

Quickstart

1. Clone and install

2. Configure

3. Run

4. Docker (recommended for production)

API Reference

POST /ingest

POST /query

GET /health

Benchmark Results

Project Structure

Configuration Reference

Design Notes

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
app		app
tests		tests
.env		.env
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Hybrid Search RAG Engine

Architecture

Tech Stack

Quickstart

1. Clone and install

2. Configure

3. Run

4. Docker (recommended for production)

API Reference

POST /ingest

POST /query

GET /health

Benchmark Results

Project Structure

Configuration Reference

Design Notes

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages