A simple Retrieval-Augmented Generation (RAG) system built from scratch using OpenAI's embedding and chat completion APIs.
- Vector Database: In-memory storage for text chunks and their embeddings
- Semantic Search: Uses cosine similarity to find relevant information
- OpenAI Integration: Uses OpenAI's text-embedding-3-small for embeddings and gpt-3.5-turbo for generation
- Interactive Chat: Continuous prompting interface to ask multiple questions
- Flexible Data Source: Works with any text file - just change the
DATA_FILEvariable
- Install dependencies:
pip install -r requirements.txt- Set your OpenAI API key:
export OPENAI_API_KEY="your-api-key-here"- Run the system:
python main.py-
Indexing Phase:
- Loads text data from the specified file (default:
cat-facts.txt) - Creates embeddings for each line using OpenAI's embedding model
- Stores text chunks and embeddings in an in-memory vector database
- Loads text data from the specified file (default:
-
Retrieval Phase:
- Converts user query to embedding
- Finds most similar text chunks using cosine similarity
- Returns top 5 most relevant results with similarity scores
-
Generation Phase:
- Uses retrieved chunks as context
- Generates response using OpenAI's chat completion API
- Continues prompting for more questions until user exits
To use your own data, simply:
- Create or replace the text file (each line will be treated as a separate chunk)
- Update the
DATA_FILEvariable inmain.py:DATA_FILE = 'your-data-file.txt'
Building vector database...
Loaded 150 entries from cat-facts.txt
Added 150 chunks to database
Ask me anything about the loaded data! (Type 'quit' or 'exit' to stop)
====================================================================================================
Your question: How long do cats sleep?
Top 5 relevant results:
----------------------------------------------------------------------------------------------------
1. Cats spend 2/3 of every day sleeping... (similarity: 0.8432)
2. Cats sleep 16 to 18 hours per day... (similarity: 0.8201)
...
----------------------------------------------------------------------------------------------------
Answer:
Based on the context provided, cats spend about 2/3 of every day sleeping, which amounts to 16-18 hours per day...
====================================================================================================
Your question: quit
Goodbye!
main.py: Complete RAG implementationcat-facts.txt: Sample dataset (150 cat facts) - replace with your own datarequirements.txt: Python dependenciesREADME.md: This file