RAG Evaluation Usage Guide

Quick Start

1. Test with Limited Examples (Recommended First Step)

python rag_eval.py \
  --model meta-llama/Llama-3.2-3B \
  --tasks mmlu_global_facts \
  --device mps \
  --limit 10

2. Full Evaluation

python rag_eval.py \
  --model meta-llama/Llama-3.2-3B \
  --tasks mmlu_global_facts \
  --device mps

Comparing Baseline vs RAG

Run Baseline Evaluation

lm_eval --model hf \
  --model_args pretrained=meta-llama/Llama-3.2-3B \
  --tasks mmlu_global_facts \
  --output_path ./results/ \
  --device mps

Run RAG Evaluation

python rag_eval.py \
  --model meta-llama/Llama-3.2-3B \
  --tasks mmlu_global_facts \
  --device mps

View Results in Dashboard

cd dashboard
python server.py

The dashboard will show both models:

meta-llama__Llama-3.2-3B - Baseline model
rag-meta-llama__Llama-3.2-3B - RAG-enhanced model

Advanced Options

Adjust Number of Retrieved Chunks

# Default is 3, try 5 for more context
python rag_eval.py \
  --model meta-llama/Llama-3.2-3B \
  --tasks mmlu_global_facts \
  --device mps \
  --n_retrieval 5

Different Tasks

# Try other MMLU subjects where Wikipedia might help
python rag_eval.py \
  --model meta-llama/Llama-3.2-3B \
  --tasks mmlu_world_religions \
  --device mps

Custom ChromaDB Location

python rag_eval.py \
  --model meta-llama/Llama-3.2-3B \
  --tasks mmlu_global_facts \
  --device mps \
  --chroma_path /path/to/chroma_db

Understanding the Results

Retrieval Statistics

After each run, you'll see:

📊 Retrieval Statistics:
   Total retrievals performed: 40

This shows how many times the system queried Wikipedia (one per answer choice in multiple-choice questions).

Performance Metrics

The results summary shows:

📈 Results Summary:
   mmlu_global_facts: 30.00%

Compare this with your baseline results to measure RAG improvement!

How RAG Enhances Answers

For each question, the system:

Extracts the question from the MMLU prompt
Searches Wikipedia for the 3 most relevant chunks

Augments the prompt with context:

Reference information from Wikipedia:
[Article 1] relevant text...

[Article 2] relevant text...

[Article 3] relevant text...

Based on the above information and your knowledge, answer the following:

[Original MMLU question]

Model answers with enriched context

Troubleshooting

"No module named 'chromadb'"

Make sure you've activated the virtual environment:

source venv/bin/activate

Out of Memory

Try reducing the number of retrieved chunks:

python rag_eval.py --n_retrieval 1 --limit 10

Slow Performance

The RAG evaluation is slower than baseline because:

It queries ChromaDB for each question
Prompts are longer (includes retrieved context)

For testing, use --limit to evaluate fewer examples.

Best Practices

Start small: Always test with --limit 10 first
Compare apples-to-apples: Run baseline and RAG on the same tasks
Experiment with retrieval: Try different --n_retrieval values (1, 3, 5, 7)
Check relevance: Use query_embeddings.py to verify your knowledge base has relevant content
Monitor resources: RAG uses more memory due to longer prompts

Results Storage

Results are automatically saved to:

./results/rag-{model-name}/{timestamp}.json

These files are compatible with your existing dashboard and can be compared directly with baseline results.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RAG Evaluation Usage Guide

Quick Start

1. Test with Limited Examples (Recommended First Step)

2. Full Evaluation

Comparing Baseline vs RAG

Run Baseline Evaluation

Run RAG Evaluation

View Results in Dashboard

Advanced Options

Adjust Number of Retrieved Chunks

Different Tasks

Custom ChromaDB Location

Understanding the Results

Retrieval Statistics

Performance Metrics

How RAG Enhances Answers

Troubleshooting

"No module named 'chromadb'"

Out of Memory

Slow Performance

Best Practices

Results Storage

FilesExpand file tree

RAG_USAGE.md

Latest commit

History

RAG_USAGE.md

File metadata and controls

RAG Evaluation Usage Guide

Quick Start

1. Test with Limited Examples (Recommended First Step)

2. Full Evaluation

Comparing Baseline vs RAG

Run Baseline Evaluation

Run RAG Evaluation

View Results in Dashboard

Advanced Options

Adjust Number of Retrieved Chunks

Different Tasks

Custom ChromaDB Location

Understanding the Results

Retrieval Statistics

Performance Metrics

How RAG Enhances Answers

Troubleshooting

"No module named 'chromadb'"

Out of Memory

Slow Performance

Best Practices

Results Storage