llm-rag-assistant is a fully local, retrieval-augmented chatbot powered by llama-cpp-python. It answers questions in Spanish using your own Q&A dataset: FAISS + multilingual sentence-transformers retrieve relevant context, and a local instruction-tuned LLM (default: Gemma 3 1B Instruct, GGUF) generates the response.
Looking for the Spanish version? See
README_es.md.
| Component | Version |
|---|---|
| Python | 3.10.14 |
| llama-cpp-python | 0.3.2 |
| faiss-cpu | 1.7.4 |
| sentence-transformers | 2.7.0 |
| torch (CPU) | 2.2.0 |
| transformers | 4.36.2 |
| accelerate | 0.26.0 |
| scikit-learn | 1.4.2 |
| numpy | 1.26.4 |
| scipy | 1.11.4 |
| bert-score | 0.3.13 |
| rouge-score | 0.1.2 |
| nltk | 3.8.1 |
All Python dependencies are pinned in
requirements.txtfor reproducibility.
With GPU
!pip install faiss-gpu-cu11
Or faiss-gpu-cu12
- 🔍 Semantic search with multilingual sentence-transformers
- 🧠 Local LLM inference with llama-cpp-python (CPU friendly GGUF models)
- 💻 Runs on standard laptops/desktops — no GPU or CUDA required
- 🔒 100% offline, no API keys or external services
- 🗂️ Works with any JSON Q&A dataset
This repository ships a console-based RAG chatbot that runs entirely offline.
-
Python 3.9+
-
Install dependencies (recommend
llama-cpp-python >= 0.3.2for Gemma 3 support):pip install "llama-cpp-python>=0.3.2" faiss-cpu sentence-transformersOn macOS you can fall back to conda if compilation fails:
conda install -c conda-forge llama-cpp-python pip install faiss-cpu sentence-transformers
-
Download a GGUF model and place it under
../models/:- Gemma 3 1B Instruct (recommended)
wget https://huggingface.co/google/gemma-3-1b-it-GGUF/resolve/main/gemma-3-1b-it-Q4_K_M.gguf \ -O gemma-3-1b-it.Q4_K_M.gguf
- Mistral-7B-Instruct
wget https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF/resolve/main/mistral-7b-instruct-v0.1.Q4_K_M.gguf \ -O mistral-7b-instruct.Q4_K_M.gguf
ℹ️ Hugging Face may require you to sign in and accept the license before downloading. If you hit a 403 error, open the model page, accept the terms, and rerun the command.
Always verify file integrity by comparing the
sha256hash against the value published by the model provider:sha256 gemma-3-1b-it.Q4_K_M.gguf sha256 mistral-7b-instruct.Q4_K_M.gguf
- Gemma 3: Gemma license → https://ai.google.dev/gemma
- Mistral 7B: Apache 2.0 → https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1
Recommended steps before downloading Gemma:
- Sign in with your Hugging Face account and accept the license in the Files and versions tab of
google/gemma-3-1b-it-GGUF. - Log into the CLI:
huggingface-cli login(or use an access token). - Run the
wget/huggingface-cli downloadcommands above. - Confirm
llama-cpp-python >= 0.3.2; older releases throw “unknown model architecture: 'gemma3'”.
Transformers backend (optional):
huggingface-cli download google/gemma-3-1b-it \ --local-dir ../models/gemma-3-1b-it-transformers \ --local-dir-use-symlinks False
Requires license acceptance and
huggingface-cli login. - Gemma 3 1B Instruct (recommended)
-
Build your Q&A dataset in
qa_dataset.json:[ { "pregunta": "¿Cuál es el horario de atención?", "respuesta": "Nuestro horario es de lunes a viernes de 9 a 18 y sábados de 9 a 14." }, { "pregunta": "¿Cómo puedo contactar con soporte técnico?", "respuesta": "Puedes escribir a soporte@empresa.com o llamar al 900-123-456." } ] -
Configure
config.yaml:models: embeddings: model_name: "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2" generation: llama_cpp_model_path: "../models/gemma-3-1b-it.Q4_K_M.gguf" max_tokens: 256
prepare_embeddings.py→ buildsdataset_index.faissandqa.jsonchatbot_rag_local.py→ console chatbot using llama-cppchatbot_rag_local_transformers.py→ transformers-based alternativeqa_dataset.json→ user knowledge base (input)
python prepare_embeddings.pypython chatbot_rag_local.py- Chat with your knowledge base in Spanish :)
Build the image from the repository root:
docker build -t llm-rag-assistant .Run the console bot, mounting your local GGUF directory so /app can resolve ../models/...:
docker run --rm -it \
-v $(pwd)/../models:/models \
llm-rag-assistantWith faiss db
docker run --rm -it \
-v $(pwd)/dataset_index.faiss:/app/dataset_index.faiss \
-v $(pwd)/qa.json:/app/qa.json \
-v $(pwd)/../models:/models \
llm-rag-assistant
```
Override the default command to run other workflows when needed:
```bash
docker run --rm -it \
-v $(pwd)/../models:/models \
llm-rag-assistant \
python model_evaluation.py bertscore
```
Alternative without llama.cpp (transformers)
-------------------------------------------
1. `pip install torch transformers accelerate`
2. Download Hugging Face weights to `../models/gemma-3-1b-it-transformers`
3. Update `config.yaml` → `transformers` section (path/device/dtype)
4. `python chatbot_rag_local_transformers.py`
> Needs ample RAM (12 GB+ recommended) or a GPU for smooth inference.
Models available under `../models`
----------------------------------
| Model | Strengths | Considerations |
|-------|-----------|----------------|
| `gemma-3-1b-it.Q4_K_M.gguf` | ✅ 1B parameters, Q4_K_M quantization (~2.1 GB). Fast startup on CPU/MPS.<br>✅ Tuned for Spanish/science; low hallucination rate.<br>✅ Works great on Apple Silicon and AVX2-only CPUs. | ℹ️ Must accept Gemma license and use `llama-cpp-python` ≥0.3.2.<br>ℹ️ Smaller model may need more explicit prompts for long answers. |
| `mistral-7b-instruct.Q4_K_M.gguf` | ✅ 7B parameters, robust with generic prompts.<br>✅ Widely battle-tested. | ⚠️ ~4.1 GB on disk, slower on CPU.<br>⚠️ Higher RAM usage (~7–8 GB with long contexts). |
| `qwen2.5-1.5b-instruct-q2_k.gguf` | ✅ Ultra-light (<1 GB), ideal for tight hardware budgets.<br>✅ Good multilingual coverage. | ⚠️ Aggressive Q2 quantization → lower fidelity.<br>⚠️ Needs carefully structured prompts. |
**Recommendation**: Gemma 3 1B Instruct offers the best balance for this project—fast, accurate in Spanish, and resource-friendly. Keep Mistral as a backup if you need longer answers and have extra RAM.
Evaluation & Metrics
====================
Available scripts
-----------------
- `model_evaluation.py` → generates answers (default) or runs BERTScore (`python model_evaluation.py bertscore`)
- `calculate_metrics_from_json.py` → recomputes BERTScore, ROUGE, BLEU, cosine similarity from an existing JSON
- `real_rag_evaluation.py` → end-to-end evaluation of the live RAG pipeline
Sample workflow
---------------
```bash
# 1. Generate answers
python model_evaluation.py
# 2. Compute BERTScore on existing results
python model_evaluation.py bertscore
```
Outputs:
- `evaluation_results.json` – question/ground-truth/generated triples
- `bertscore_results.json` – BERTScore stats and per-sample metrics
Metric interpretation
---------------------
- **BERTScore F1**
- > 0.85 → Excellent
- 0.70–0.85 → Good
- 0.50–0.70 → Needs improvement
- < 0.50 → Problematic
- **ROUGE-1/2/L** → Unigram/bigram/longest-sequence overlap
- **BLEU-4** → 4-gram precision (typical range 0.2–0.6 for natural text)
Configuration knobs (in the scripts)
------------------------------------
```python
n_samples = 15 # Number of Q&A pairs to evaluate
random.seed(42) # Reproducibility
lang = "es" # Language for BERTScore
```
Example output
--------------
```
📊 BERTScore (semantic similarity):
Precision: 0.7724 ± 0.0879
Recall: 0.8905 ± 0.0591
F1-Score: 0.8265 ± 0.0732
📝 ROUGE:
ROUGE-1: 0.5064 ± 0.2007
ROUGE-2: 0.4026 ± 0.2220
ROUGE-L: 0.4760 ± 0.2138
```
Hardware recommendations
------------------------
- Minimum 8 GB RAM (16 GB preferred)
- ~5 GB free disk space for models and indexes