DocuMind is a local-first RAG stack: two Chroma collections (public default, papers for PDFs/DOCX/text/arXiv), FastAPI with per-request library, chunking and cosine retrieval, and citations in responses. Inference defaults to Ollama on your hardware. Bulk public text is indexed offline; the web UI is a status and query client over the same REST API.
This document covers architecture, configuration, API behavior, and deployment for handoff and extension.
- System overview
- Design principles
- Repository layout
- Runtime architecture
- Data lifecycle: ingest → index
- Retrieval and generation pipeline
- Query modes
- Retrieval strategies and ablation
- HTTP API
- Configuration
- Security middleware
- Observability and reliability
- Deployment
- Bundled corpus and scripts
- Testing
- Known limitations and extension points
- Portfolio artifacts
- References
- Delivery scope, limits, and scale
- Scale runbook (capacity + cloud migration) — §9 cloud stages and service mapping.
| Layer | Responsibility |
|---|---|
| Presentation | Next.js 15 (web/): status, diagnostics, and query UI against the REST API. Optional Streamlit (frontend/app.py). |
| Application | FastAPI application (app/main.py): routing, middleware, dependency injection, lifespan-managed singletons. |
| Domain services | Document parsing and chunking (app/services/document_service.py, app/utils/chunker.py); vector persistence (app/services/embedding_service.py); RAG orchestration (app/services/rag_service.py). |
| Model I/O | Ollama client (app/utils/ollama_client.py): chat completions and per-text embeddings over HTTP. |
| Persistence | Chroma persistent client on disk (CHROMA_PERSIST_DIR); two collections — CHROMA_COLLECTION_PUBLIC (encyclopedia-scale) and CHROMA_COLLECTION_NAME (papers / PDFs / arXiv; legacy env name for the non-public index). Cosine space (hnsw:space: cosine). |
Ports (convention): API 8001, Next.js dev 3002, Ollama 11434.
- Grounding first — Final user-facing answers for LLM-backed modes are conditioned only on retrieved chunk text; prompts explicitly forbid inventing papers, metrics, or datasets absent from context.
- Explicit provenance — Responses include
SourceCitationobjects (document id, title, section, page hint, chunk index, distance, preview). - Dependency-aware serving — Liveness vs readiness split so orchestrators can distinguish “process up” from “dependencies usable”.
- Configurable retrieval policy — Top‑k, distance cutoff, keyword rerank weight, fallback when strict filtering returns nothing, and diversity caps are all environment-tunable.
- Single-tenant baseline — One shared library index per deployment; ACLs per document are not implemented in-tree (see §16).
| Path | Role |
|---|---|
app/main.py |
FastAPI app, lifespan, global exception handler, middleware, router includes. |
app/config.py |
pydantic-settings Settings; single cached get_settings(). |
app/logging_config.py |
Optional JSON logging layout. |
app/routers/ingest.py |
Multipart ingest, delete by doc_id. |
app/routers/papers.py |
List / get / delete paper metadata from index. |
app/routers/query.py |
RAG query and collection stats. |
app/routers/arxiv.py |
arXiv PDF fetch by id. |
app/services/document_service.py |
File type detection, text extraction, delegation to chunker. |
app/services/embedding_service.py |
Chroma add/query/delete; Ollama embeddings. |
app/services/rag_service.py |
Retrieval, rerank, diversity, mode prompts, FLARE branch, Ollama chat. |
app/utils/chunker.py |
RecursiveCharacterTextSplitter; section heuristics in metadata. |
app/utils/ollama_client.py |
Retry-wrapped HTTP to Ollama /api/chat and /api/embeddings. |
app/models/ |
Pydantic request/response models shared by routers. |
data/sample_docs/ |
Bundled UTF-8 corpus (see §14). |
tests/ |
API and unit tests; tests/conftest.py uses dependency overrides and fake embedding/RAG for isolation. |
evaluation/ |
Optional regression fixtures for pipeline shape. |
scripts/ |
Corpus generators, portfolio PDF, arXiv bulk helpers. |
web/ |
Next.js client for the API. |
Dockerfile / docker-compose.yml |
Container image (Python 3.11-slim, non-root) and Compose stack with Chroma volume + healthcheck. |
flowchart TB
subgraph clients [Clients]
N[Next.js]
S[Streamlit]
end
subgraph api [DocuMind API]
F[FastAPI]
L[Lifespan: services + seed]
end
subgraph svc [Services]
D[DocumentService]
E[ChromaEmbeddingService]
R[RAGService]
end
subgraph ext [External]
O[Ollama]
C[(ChromaDB)]
end
N --> F
S --> F
F --> L
L --> D
L --> E
L --> R
D --> E
R --> E
R --> O
E --> O
E --> C
Lifespan (app/main.py): On startup, constructs OllamaClient, one shared chromadb.PersistentClient on CHROMA_PERSIST_DIR, then an EmbeddingRegistry with two ChromaEmbeddingService wrappers (papers + public collections) and two RAGService instances (content_library each). Sharing one client avoids double-opening the same SQLite store. When SEED_SAMPLE_DOCS=true (off by default; legacy DS demo path), runs seed_sample_docs into the papers collection only: compares SAMPLE_CORPUS_VERSION marker on disk to settings; on mismatch, deletes sample_* vectors in that collection, rewrites marker, then ingests each data/sample_docs/*.txt as sample_<stem>. Wikipedia-first: keep SEED_SAMPLE_DOCS=false and grow CHROMA_COLLECTION_PUBLIC via scripts/bulk_index_public.py / scripts/build_public_corpus.py (or POST /api/v1/ingest with library=public).
Routers mount under /api/v1 except health routes at root.
- Input:
POST /api/v1/ingest(multipart/form-data: file + optionallibraryfield, defaultpublic) orPOST /api/v1/fetch-arxiv(JSONarxiv_id; always indexes papers). - Validation: File size cap
MAX_FILE_SIZE_MB; MIME/type checks in ingest router / document service. - Extraction: PyPDF2 for PDF, python-docx for DOCX, raw decode for
.txt. - Metadata: Heuristic title, authors, year, optional arXiv id from leading text when parseable.
- Chunking:
DocumentChunkeruses LangChainRecursiveCharacterTextSplitterwithCHUNK_SIZEandCHUNK_OVERLAP. Eachlangchain_core.documents.Documentcarries metadata:doc_id,filename,section(heuristic),chunk_index,page_numberwhen known, etc. - Indexing:
ChromaEmbeddingService.add_documents(HTTP path) oradd_indexed_batch(bulk indexer) embeds chunks via OllamaEMBEDDING_MODEL, writes to the selected collection with stable ids{doc_id}_{i}. Each chunk metadata is stamped withembedding_model,chroma_collection, andindexed_at(UTC) for re-embed and drift workflows.
DELETE /api/v1/papers/{doc_id} and DELETE /api/v1/ingest/{doc_id} call embedding_service.delete_document. If no chunks exist for that doc_id, the service returns false and the API responds 404 — empty delete is not silently successful.
Each Chroma collection is created with metadata={"hnsw:space": "cosine"}. Query results expose distance per hit; the RAG layer sorts ascending (lower distance = closer match) and keeps rows under the library-specific distance cutoff (RELEVANCE_THRESHOLD vs PUBLIC_RELEVANCE_THRESHOLD) before optional fallback.
section_filter applies to Chroma where on metadata section. Papers (PDFs / arXiv): values mirror paper-heading heuristics (abstract, methodology, …). Public (Wikipedia-style .txt): most chunks are body; restrictive filters often return no hits — omit section_filter for broad public queries unless you control chunk metadata.
All logic below is implemented in app/services/rag_service.py unless noted.
For a user top_k and query_mode, the service expands the vector search n_results before reranking. Papers library: up to 64 candidates for general / compare (roughly 4× top_k, capped). Public (encyclopedia) library: wider pool (cap 96, multiplier 5× top_k for general/compare) to improve recall on long prose. Other modes use a slightly smaller cap for public.
embedding_service.search(embed_query, retrieve_k, section_filter)returns rows{content, metadata, distance}.- Keyword rerank: Rows are sorted by
distance − W × keyword_overlap_score(rerank_query, content)
whereWisKEYWORD_RERANK_WEIGHT(papers) orPUBLIC_KEYWORD_RERANK_WEIGHT(public). - Threshold filter: Keep rows with
distance <RELEVANCE_THRESHOLD(papers) orPUBLIC_RELEVANCE_THRESHOLD(public). Compare mode adds a small slack bump (seerag_service.py). - Fallback: If nothing passes and
ENABLE_FALLBACK_RETRIEVALis true, take the topFALLBACK_TOP_Nby rerank order and mark internally (answer may append a disclosure line). - Diversity:
_select_diverse_sourcesprefers at most one strong chunk perdoc_idbefore filling remaining slots, reducing single-document context monopolization. - Context slot cap: Depends on
query_mode(e.g. up to 24 chunks forgeneral/compare).
datasetsmode: Does not call the LLM for the main body. It scans retrieved chunk text for known dataset hints and patterns, emits a structured Markdown inventory. FLARE is skipped.- Other modes: Builds a single context block from selected chunks, applies the mode’s system prompt (
SYSTEM_PROMPTSfor papers,PUBLIC_SYSTEM_PROMPTSfor public), callsOllamaClient.chatwith mode-dependent temperature, returns Markdown answer plusSourceCitationlist. - Confidence: Derived from mean chunk distance (clamped); exposed as a scalar for UI.
query_mode |
Papers library (library=papers) |
Public library (library=public) |
|---|---|---|
general |
Broad grounded synthesis over research excerpts. | Same shape; prompts use Article title and encyclopedia-neutral tone. |
compare |
Cross-paper comparison; table-oriented. | Cross-article comparison; same table pattern with article titles. |
methodology |
Implementation-focused extraction from papers. | Process / mechanism extraction from encyclopedia prose. |
datasets |
Deterministic dataset / benchmark hints from chunk text. | Same scanner; useful when excerpts name corpora (e.g. “Wikipedia”). |
reproduce |
Reproducibility checklist for experiments. | “What could be reproduced from excerpts” framing. |
Optional section_filter — see §5.4.
Default: retrieval_strategy=baseline — single dense-vector pass, keyword rerank, distance threshold, source diversity, then generation.
Request fields: retrieval_strategy (baseline | flare | hyde | multi_query), legacy use_flare (maps to flare when strategy is baseline), and retrieve_only (skip final answer LLM — for benchmarks).
| Strategy | Mechanism | Extra LLM calls |
|---|---|---|
| baseline | Embed user query → Chroma → rerank → filter | 0 |
| flare | Same first pass; forward-looking draft; second search if draft has ??? or hedges |
0–1 draft |
| hyde | LLM writes hypothetical passage; embed that for search (Gao et al., HyDE) | 1 |
| multi_query | LLM emits up to 3 sub-queries; search each; RRF fuse ranks | 1 |
datasets mode always uses baseline extraction (no strategy LLM helpers).
Full FLARE (Jiang et al., arXiv:2305.06983) triggers retrieval from token-level confidence during generation. Ollama chat here does not expose logprobs, so flare uses a short draft with explicit ??? / “not stated in excerpt” hedges and flare_triggers_follow_up() to gate a second pass.
# API + Ollama up, public corpus indexed
.\.venv\Scripts\python scripts\run_retrieval_ablation.py --base-url http://127.0.0.1:8001 `
--report-md evaluation/reports/retrieval_ablation.md `
--csv evaluation/reports/retrieval_ablation.csv `
--json-out evaluation/reports/retrieval_ablation_summary.jsonBenchmark cases: evaluation/retrieval_ablation.json (8 probes) or --bench evaluation/wiki_public_bench.json. The script prints per-strategy grounded rate, latency p50/p95, avg unique docs, and Jaccard overlap vs baseline for slide-ready Markdown.
Stakeholder visuals: Offline scores → python scripts/run_retrieval_ablation_offline.py then python scripts/refresh_stakeholder_views.py → open evaluation/reports/stakeholder_dashboard.html. In Cursor, open canvas documind-retrieval-scores (charts + per-query scorecard).
| Approach | Why not default in this repo |
|---|---|
| Token-level FLARE | Needs logprobs from the generator host |
| Self-RAG / CRAG | Heavy orchestration for a local demo |
| Cross-encoder rerank | Strong in prod; kept Ollama-only for clone-friendly deploys |
| Method | Path | Body / params | Notes |
|---|---|---|---|
| GET | /health |
— | Ollama availability + stats for DEFAULT_LIBRARY collection. |
| GET | /health/live |
— | Process liveness. |
| GET | /health/ready |
— | 503 if dependencies not ready. |
| GET | /api/v1/libraries |
— | Both collections’ CollectionStats + default_library (ops / capacity). |
| GET | /api/v1/diagnostics |
— | Operator snapshot: API version, uptime, Python, active retrieval thresholds/weights, chunk defaults, both index counts, DOCMIND_GIT_SHA if set. |
| POST | /api/v1/ingest |
multipart/form-data: file, optional library |
Indexes into public or papers. |
| DELETE | /api/v1/ingest/{doc_id} |
Query ?library= (default public) |
404 if no chunks. |
| POST | /api/v1/fetch-arxiv |
{ "arxiv_id": "..." } |
Downloads PDF; indexes papers only. |
| POST | /api/v1/query |
QueryRequest JSON (library default public) |
See app/models/request_models.py. |
| GET | /api/v1/papers |
Query ?library= |
Library cards. |
| GET | /api/v1/papers/{doc_id} |
Query ?library= |
One document. |
| DELETE | /api/v1/papers/{doc_id} |
Query ?library= |
404 if no chunks. |
| GET | /api/v1/collection/stats |
Query ?library= |
Aggregate counts for one collection. |
OpenAPI: /docs, /redoc, /openapi.json unless DISABLE_OPENAPI=true.
Authentication: When API_KEY is non-empty, all /api/v1/* routes (except CORS preflight) require header X-API-Key matching the setting; mismatch → 401.
All keys are listed in .env.example. Grouped reference:
| Group | Variables | Purpose |
|---|---|---|
| Models | OLLAMA_BASE_URL, LLM_MODEL, EMBEDDING_MODEL, OLLAMA_REQUEST_TIMEOUT_SEC |
Inference endpoints, model tags, HTTP timeout for chat/embed (bulk-friendly). |
| Vector store | CHROMA_PERSIST_DIR, CHROMA_COLLECTION_NAME (papers), CHROMA_COLLECTION_PUBLIC, DEFAULT_LIBRARY |
On-disk path; collection names; default library for /health stats. |
| Chunking | CHUNK_SIZE, CHUNK_OVERLAP |
Text splitter parameters; affects chunk count and context granularity. |
| Retrieval defaults | TOP_K_RESULTS, RELEVANCE_THRESHOLD, PUBLIC_RELEVANCE_THRESHOLD, PUBLIC_KEYWORD_RERANK_WEIGHT, ENABLE_FALLBACK_RETRIEVAL, FALLBACK_TOP_N, KEYWORD_RERANK_WEIGHT |
Papers vs public distance/lexical tuning; per-request top_k overrides for query. |
| Ingest | MAX_FILE_SIZE_MB, ARXIV_BASE_URL |
Upload cap and arXiv PDF export host. |
| Sample corpus (legacy) | SAMPLE_CORPUS_VERSION, SEED_SAMPLE_DOCS |
When SEED_SAMPLE_DOCS=true: bump version to purge/re-seed sample_* in papers only. Default false (Wikipedia-first). |
| Network | CORS_ORIGINS, CORS_ALLOW_ALL, TRUSTED_HOSTS |
Browser and Host-header policy. |
| App | APP_ENV, DISABLE_OPENAPI |
Environment label; docs toggle. |
| Security / transport | API_KEY, ENABLE_RESPONSE_GZIP |
Optional API key gate; gzip responses. |
| Logging | LOG_LEVEL, LOG_JSON |
Verbosity and JSON log lines. |
| FLARE | FLARE_ACTIVE_RETRIEVAL, FLARE_DRAFT_MAX_CONTEXT_CHARS |
Global FLARE default and draft context budget. |
Applied in app/main.py (order matters for FastAPI / Starlette):
- CORS —
CORSMiddlewarewith explicit origins or wildcard whenCORS_ALLOW_ALL(dev-only). - Trusted hosts — Optional
TrustedHostMiddlewarewhenTRUSTED_HOSTSis set. - Gzip —
GZipMiddlewarewhenENABLE_RESPONSE_GZIPand payload exceeds minimum size. - Per-request —
X-Request-IDassignment, optional API key gate, default security headers (X-Content-Type-Options,X-Frame-Options,Referrer-Policy;Permissions-Policyin productionAPP_ENV). - Errors —
HTTPExceptionandRequestValidationErrorreturn structured JSON; uncaught exceptions return 500 withrequest_idin body.
- Request correlation — Every response carries
X-Request-ID; access logs includerequest_id, method, path, status,duration_ms. - Structured logs —
LOG_JSON=truefor log platforms. - Healthchecks — Docker Compose defines an HTTP probe against
/health/live(seedocker-compose.yml). Prefer/health/readyfor LB routing when Ollama and Chroma must be live. - Chroma persist corruption (development) — If opening the store raises a recoverable Chroma/Rust error (
APP_ENV=development), the API renamesCHROMA_PERSIST_DIRto a sibling*.broken.<UTC>folder, then exits startup withRuntimeError. Restart the process once so a fresh Python interpreter opens the new empty directory (PyO3 panics can poison in-process bindings; an immediate re-open in the same process is unsafe). Production/staging surfaces the error without renaming.
| Target | Command / notes |
|---|---|
| Docker Compose | docker compose up --build — publishes 8001, mounts Chroma volume chroma_data, read-only ./data. Set OLLAMA_BASE_URL to reachable Ollama (default host.docker.internal:11434 on Docker Desktop). |
| Bare metal / VM | uvicorn app.main:app --host 0.0.0.0 --port 8001 (add --proxy-headers behind TLS terminator per your platform). |
| Windows dev | .\start_documind.ps1 (Ollama, API, Next); uses .venv\Scripts\python.exe when present. First boot can sit in corpus ingest for a long time before /health responds; the script waits up to 180 minutes (-MaxApiWaitMinutes). -SkipModelPull speeds repeat boots. .\stop_documind.ps1 clears ports 3002, 8001, 11434 — confirm Ollama shutdown is intended. |
Backup: Copy CHROMA_PERSIST_DIR regularly; it is the authoritative index. Source PDFs/DOCX should remain in object storage or VCS-independent archives if they are not all under data/.
| What | Where | Count / note |
|---|---|---|
data/sample_docs/*.txt |
Git | 463 files total (400 synthetic sample_corpus_p7_*.txt, 63 other curated/hand files). See data/sample_docs/README.md. |
| Chunk rows for that bundle | Chroma papers only if SEED_SAMPLE_DOCS=true |
Not in git — derived at index time from CHUNK_SIZE / overlap and text length (order of magnitude: low thousands to ~10k+ for the full bundle). |
| CI / pytest “corpus” | tests/ranking_fake_embedding.py |
6 synthetic doc_ids / 6 chunks — deterministic regression only. |
| Large public text | Disk + CHROMA_COLLECTION_PUBLIC |
HF streaming scripts; bulk embed with checkpoint; respect CC BY-SA if you redistribute. |
Strategic default: library=public and offline bulk index are the flagship path; the DS sample_docs bundle is legacy / optional demo for the papers collection.
data/sample_docs/— See table above and folder README. Optional demo material for papers whenSEED_SAMPLE_DOCS=true(discouraged for Wikipedia-first deployments).- Chroma in the repo clone — Often tens of MB after local indexing; size grows with chunk count × (vectors + stored text + HNSW). Empty public collection adds negligible disk until you bulk-index.
pip install datasets(for Hugging Face streaming).- One command (stream + bulk index):
python scripts/build_public_corpus.py --articles 10000
Use--articles 50000or higher for serious scale;--articles 0 --allow-unboundedstreams the full dump (disk-hungry). - Piecemeal:
scripts/stream_wikipedia_to_txt.py→scripts/bulk_index_public.py(--dry-runfor chunk estimates,--checkpointfor resume,--workersfor parallel Ollama embeds). - Ops:
GET /api/v1/librariesfor both collections’ chunk and document counts. - Scale narrative + disk report: docs/SCALE_OPERATIONS_PLAYBOOK.md;
python scripts/report_corpus_scale.py(needs running API). - Public wiki RAG bench: evaluation/wiki_public_bench.json — twenty grounded probes; run
python scripts/run_wiki_public_bench.py --base-url http://127.0.0.1:8001(tunePUBLIC_RELEVANCE_THRESHOLD/PUBLIC_KEYWORD_RERANK_WEIGHTin.envif recall is thin on a small slice).
- Regeneration (legacy papers bundle):
python scripts/generate_production_corpus.py --count 500 --forcethen bumpSAMPLE_CORPUS_VERSION(only withSEED_SAMPLE_DOCS=true). - Hand-authored expansion:
scripts/materialize_institutional_corpus.py. - arXiv bulk:
scripts/bulk_ingest_arxiv.py+data/arxiv_seed_list.txt(indexes papers).
pytest -qtests/conftest.py overrides FastAPI dependencies with fake embedding/RAG services so unit tests do not require Ollama or Chroma.
Query regression suite: tests/test_rag_query_suite.py runs 20 parameterized cases (tests/query_eval_cases.py) against real RAGService with a ranking fake vector layer (tests/ranking_fake_embedding.py) and a deterministic Ollama stub — metrics cover status, has_answer, source counts, answer substrings, and wall time. Live library smoke: python scripts/run_query_eval.py --base-url http://127.0.0.1:8001 (optional --csv report.csv; skips empty-corpus cases with --skip-empty-corpus-cases).
CI: On push and pull request to main or master, GitHub Actions (.github/workflows/ci.yml) runs on Python 3.11 and 3.12: ruff check (syntax / undefined-name rules), then pytest. No Ollama or Chroma in CI. Pytest and Ruff defaults: pyproject.toml. Dependabot for Actions: .github/dependabot.yml.
Not implemented in this repository (non-exhaustive):
- Per-user or per-tenant ACL on chunks or documents.
- SSO / OIDC for the API or UI.
- OCR pipeline for low-quality scanned PDFs beyond basic text extraction.
- Hosted managed vector SaaS swap (Pinecone, Weaviate, etc.) — would replace
ChromaEmbeddingServicewhile preserving router contracts. - Token-level FLARE — requires a host that exposes logprobs or an alternative uncertainty model.
- Chroma auto-quarantine — development-only; requires one manual restart after a bad on-disk store is moved aside (see §12).
Natural extensions: swap Ollama for OpenAI/Azure OpenAI behind the same RAGService boundary; add golden-set eval CI; wire /health/ready to load balancers; add cross-encoder reranking as an optional second stage.
Under portfolio/: client project catalog HTML, portfolio brief HTML, optional PDF generation (scripts/portfolio_requirements.txt, scripts/generate_portfolio_pdf.py), dashboard screenshot portfolio/screenshots/documind-dashboard.png.
Regenerating the screenshot (recommended): A bare playwright screenshot of the home page misses indexed doc counts and synthesis. Use the bundled driver after API + Next are up and sample ingest has progressed:
.\.venv\Scripts\pip install -r scripts\screenshot_requirements.txt
.\.venv\Scripts\playwright install chromium
.\scripts\capture_dashboard.ps1 # waits for /health/live, baseline public scenario, synthesis text, tall-viewport PNG
# Smaller corpus / faster index gate: .\scripts\capture_dashboard.ps1 -MinDocs 5
# Custom API / wait cap: .\scripts\capture_dashboard.ps1 -ApiBase "http://127.0.0.1:8001" -MaxLivenessWaitMinutes 240Or directly: .\.venv\Scripts\python scripts\capture_dashboard_playwright.py --help — waits for ≥120 chars in .prose-answer, scrolls synthesis into view, writes 1680×3200 portfolio/screenshots/documind-dashboard.png (default --viewport-width 1680; use --viewport-width 1440 if needed), then 1000×750 portfolio/screenshots/documind-upwork-catalog-1000x750.png (default stack infographic tile; --plain-catalog-thumb top-crops the dashboard). Thumb only: .\.venv\Scripts\python scripts\capture_dashboard_playwright.py --thumb-only. Standalone tile: python scripts/catalog_thumb_art.py --out portfolio/screenshots/documind-upwork-catalog-1000x750.png. Avoid --full-page for portfolio assets.
- Jiang et al., Active Retrieval Augmented Generation (FLARE), arXiv:2305.06983.
- Gao et al., Precise Zero-Shot Dense Retrieval without Relevance Labels (HyDE), arXiv:2212.10496.
- FastAPI, Pydantic v2, ChromaDB, LangChain text splitters, Ollama HTTP API.
For staged migration (compute split, managed inference, job-based ingest, vector tier, rate limits), use playbook §9.
- Dual corpora with explicit
libraryrouting; shared Chroma client, two collections; grounded answers with source list and distances. - Health:
/health,/health/live,/health/ready; ops:/api/v1/libraries,/api/v1/diagnostics. - Retrieval: mode-specific budgets, keyword rerank, distance cutoffs (separate for public vs papers), optional second retrieval pass (FLARE-shaped), structured
datasetsextraction path. - Tests:
pytestwith fakes for CI; optional live checks viascripts/run_query_eval.py.
- Per-tenant ACLs, SSO, billing, multi-region HA, managed vector SaaS (swap behind the same service boundaries is a separate project).
- Full LLM+embed golden tests in CI (cost); run against a live stack when needed.
- One
PersistentClient, two collections — avoids double-opening the same SQLite path. - Chroma disk failures — dev quarantine + process restart path; production needs backup/restore runbooks.
- FLARE without logprobs — follow-up retrieval is gated on draft heuristics (
rag_service), bounded and testable. - CI RAG tests — deterministic chat stub and ranking-aware fake embeddings in
tests/test_rag_query_suite.py.
See §8.1: HyDE, multi-query fusion, cross-encoder rerank-only, self-RAG/CRAG vs the bounded second-pass path used here.
Cross-encoder rerank (feature-flagged), OpenTelemetry on retrieve vs generate, frozen eval JSONL per release, SLO table (p95 / error budget). Same playbook §9 for cloud hardening sequence.
Python 3.11+ (Dockerfile pins 3.11-slim), FastAPI, Uvicorn, Pydantic Settings, ChromaDB, langchain_core + langchain-text-splitters, Ollama, Next.js 15, React 18, TypeScript, pytest, optional Streamlit.