python rag_eval.py \
--model meta-llama/Llama-3.2-3B \
--tasks mmlu_global_facts \
--device mps \
--limit 10python rag_eval.py \
--model meta-llama/Llama-3.2-3B \
--tasks mmlu_global_facts \
--device mpslm_eval --model hf \
--model_args pretrained=meta-llama/Llama-3.2-3B \
--tasks mmlu_global_facts \
--output_path ./results/ \
--device mpspython rag_eval.py \
--model meta-llama/Llama-3.2-3B \
--tasks mmlu_global_facts \
--device mpscd dashboard
python server.pyThe dashboard will show both models:
meta-llama__Llama-3.2-3B- Baseline modelrag-meta-llama__Llama-3.2-3B- RAG-enhanced model
# Default is 3, try 5 for more context
python rag_eval.py \
--model meta-llama/Llama-3.2-3B \
--tasks mmlu_global_facts \
--device mps \
--n_retrieval 5# Try other MMLU subjects where Wikipedia might help
python rag_eval.py \
--model meta-llama/Llama-3.2-3B \
--tasks mmlu_world_religions \
--device mpspython rag_eval.py \
--model meta-llama/Llama-3.2-3B \
--tasks mmlu_global_facts \
--device mps \
--chroma_path /path/to/chroma_dbAfter each run, you'll see:
📊 Retrieval Statistics:
Total retrievals performed: 40
This shows how many times the system queried Wikipedia (one per answer choice in multiple-choice questions).
The results summary shows:
📈 Results Summary:
mmlu_global_facts: 30.00%
Compare this with your baseline results to measure RAG improvement!
For each question, the system:
- Extracts the question from the MMLU prompt
- Searches Wikipedia for the 3 most relevant chunks
- Augments the prompt with context:
Reference information from Wikipedia: [Article 1] relevant text... [Article 2] relevant text... [Article 3] relevant text... Based on the above information and your knowledge, answer the following: [Original MMLU question] - Model answers with enriched context
Make sure you've activated the virtual environment:
source venv/bin/activateTry reducing the number of retrieved chunks:
python rag_eval.py --n_retrieval 1 --limit 10The RAG evaluation is slower than baseline because:
- It queries ChromaDB for each question
- Prompts are longer (includes retrieved context)
For testing, use --limit to evaluate fewer examples.
- Start small: Always test with
--limit 10first - Compare apples-to-apples: Run baseline and RAG on the same tasks
- Experiment with retrieval: Try different
--n_retrievalvalues (1, 3, 5, 7) - Check relevance: Use
query_embeddings.pyto verify your knowledge base has relevant content - Monitor resources: RAG uses more memory due to longer prompts
Results are automatically saved to:
./results/rag-{model-name}/{timestamp}.json
These files are compatible with your existing dashboard and can be compared directly with baseline results.