๐ RAG ยท Intermediate Tutorial
Build a Production RAG System
โฑ๏ธ 60 minutes
๐ 10 steps
๐ป Python 3.10+
๐ Last updated: June 2026
Most RAG tutorials use LangChain and hide the complexity. We'll build it from scratch
so you understand every piece: document loading, chunking, embeddings, vector search, and prompt augmentation.
By the end, you'll have a system that can answer questions from your own documents.
๐ฏ What You'll Build
A command-line RAG system that:
- Loads PDFs, text files, and markdown documents
- Chunks them into optimal-sized pieces (not too big, not too small)
- Creates embeddings using OpenAI's embedding model
- Stores them in ChromaDB (local vector database)
- Retrieves relevant chunks when you ask a question
- Feeds those chunks to GPT-4o-mini for an accurate, cited answer
๐ Prerequisites
- Python 3.10+ installed
- OpenAI API key
- Some documents to test with (PDFs, .txt, or .md files)
- ~$1โ2 in API credits for embeddings
Step 1: How RAG Works
Before coding, understand the pipeline:
- Ingest โ Load documents from files
- Chunk โ Split into paragraphs/sentences
- Embed โ Convert chunks to vectors (embeddings)
- Store โ Save vectors in a vector database
- Query โ Convert question to vector, find nearest neighbors
- Generate โ Feed relevant chunks + question to LLM
๐ก The magic: Embeddings capture semantic meaning. "Car" and "automobile" have similar vectors even though the words are different. This lets us find relevant documents even without keyword matching.
Step 2: Project Setup
mkdir rag-tutorial && cd rag-tutorial
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
pip install openai chromadb python-dotenv pypdf
touch rag.py .env
mkdir documents
Add your API key to .env:
OPENAI_API_KEY=sk-your-key-here
Add sample documents to the documents/ folder. If you don't have any, create a sample:
echo "AI video generation has advanced rapidly. Models like Sora, Runway Gen-3, and Kling can create realistic video from text prompts. Key challenges include temporal consistency, prompt adherence, and compute cost. The market is expected to reach $1.8B by 2027." > documents/ai-video.txt
echo "Retrieval-Augmented Generation (RAG) combines information retrieval with text generation. Instead of relying on parametric knowledge, RAG systems fetch relevant documents and include them in the prompt. This reduces hallucinations and enables knowledge cutoff extension." > documents/rag-overview.txt
Step 3: Load Documents
Create rag.py and add document loading:
import os
from pathlib import Path
def load_documents(directory="documents"):
"""Load all .txt, .md, and .pdf files from a directory"""
docs = []
for filepath in Path(directory).glob("*"):
if filepath.suffix in [".txt", ".md"]:
with open(filepath, "r", encoding="utf-8") as f:
docs.append({
"content": f.read(),
"source": str(filepath),
"type": filepath.suffix
})
elif filepath.suffix == ".pdf":
# Requires: pip install pypdf
from pypdf import PdfReader
reader = PdfReader(str(filepath))
text = "\n".join(page.extract_text() or "" for page in reader.pages)
docs.append({"content": text, "source": str(filepath), "type": ".pdf"})
return docs
# Test
if __name__ == "__main__":
docs = load_documents()
print(f"Loaded {len(docs)} documents")
for d in docs:
print(f" - {d['source']}: {len(d['content'])} chars")
๐ก For production: Add error handling, support for Word docs, web pages (via requests + BeautifulSoup), and database connectors.
Step 4: Chunk Documents
Chunking is the most important step in RAG. Too big = irrelevant info dilutes the answer. Too small = loses context.
def chunk_text(text, chunk_size=500, overlap=50):
"""Split text into overlapping chunks"""
chunks = []
start = 0
while start < len(text):
end = start + chunk_size
chunk = text[start:end]
# Try to break at sentence or paragraph boundary
if end < len(text):
# Look for period, newline, or space near the end
for breaker in ["\n\n", ". ", " "]:
idx = chunk.rfind(breaker)
if idx > chunk_size * 0.7: # Only break if we're past 70% of chunk
chunk = chunk[:idx + len(breaker)]
end = start + len(chunk)
break
chunks.append(chunk.strip())
start = end - overlap
return chunks
# Chunk all documents
def chunk_documents(docs):
"""Chunk all documents into pieces with metadata"""
chunks = []
for doc in docs:
text_chunks = chunk_text(doc["content"])
for i, chunk in enumerate(text_chunks):
chunks.append({
"text": chunk,
"source": doc["source"],
"chunk_index": i,
"total_chunks": len(text_chunks)
})
return chunks
๐ก Chunk size rules of thumb:
โข 200โ400 tokens for precise Q&A
โข 500โ1000 tokens for summaries and broad questions
โข 20โ50% overlap prevents cutting sentences in half
Step 5: Create Embeddings
Embeddings convert text into vectors. We'll use OpenAI's text-embedding-3-small โ fast, cheap, and high quality.
import os
from openai import OpenAI
from dotenv import load_dotenv
load_dotenv()
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
def create_embeddings(texts, batch_size=100):
"""Create embeddings for a list of texts"""
all_embeddings = []
for i in range(0, len(texts), batch_size):
batch = texts[i:i + batch_size]
response = client.embeddings.create(
model="text-embedding-3-small",
input=batch
)
batch_embeddings = [item.embedding for item in response.data]
all_embeddings.extend(batch_embeddings)
print(f"Embedded batch {i//batch_size + 1}/{(len(texts)-1)//batch_size + 1}")
return all_embeddings
๐ก Cost: text-embedding-3-small costs ~$0.02 per 1M tokens. Embedding 100 pages costs pennies. For 1,000 pages, budget $1โ2.
Step 6: Store in ChromaDB
ChromaDB is a local vector database โ no server needed, just Python.
import chromadb
from chromadb.utils.embedding_functions import OpenAIEmbeddingFunction
def setup_chroma():
"""Initialize ChromaDB with OpenAI embeddings"""
chroma_client = chromadb.PersistentClient(path="./chroma_db")
embedding_fn = OpenAIEmbeddingFunction(
api_key=os.getenv("OPENAI_API_KEY"),
model_name="text-embedding-3-small"
)
collection = chroma_client.get_or_create_collection(
name="documents",
embedding_function=embedding_fn,
metadata={"hnsw:space": "cosine"}
)
return collection
def index_chunks(collection, chunks):
"""Store chunks in ChromaDB"""
texts = [c["text"] for c in chunks]
ids = [f"chunk_{i}" for i in range(len(chunks))]
metadatas = [{"source": c["source"], "index": c["chunk_index"]} for c in chunks]
# Add in batches
batch_size = 100
for i in range(0, len(texts), batch_size):
collection.add(
ids=ids[i:i+batch_size],
documents=texts[i:i+batch_size],
metadatas=metadatas[i:i+batch_size]
)
print(f"Indexed batch {i//batch_size + 1}")
print(f"โ
Indexed {len(chunks)} chunks")
๐ก Persistence: ChromaDB saves to ./chroma_db. Next time you run the script, it loads existing data. Delete the folder to re-index from scratch.
Step 7: Retrieve & Query
Now the fun part โ asking questions:
def query_rag(collection, question, n_results=3):
"""Retrieve relevant chunks and generate answer"""
# 1. Retrieve relevant chunks
results = collection.query(
query_texts=[question],
n_results=n_results
)
contexts = []
for doc, meta in zip(results["documents"][0], results["metadatas"][0]):
contexts.append(f"[Source: {meta['source']}]\n{doc}")
context_text = "\n\n---\n\n".join(contexts)
# 2. Generate answer with context
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "You are a helpful assistant. Answer based ONLY on the provided context. If the answer isn't in the context, say 'I don't have enough information.' Cite sources."},
{"role": "user", "content": f"Context:\n{context_text}\n\nQuestion: {question}"}
]
)
return {
"answer": response.choices[0].message.content,
"sources": [m["source"] for m in results["metadatas"][0]],
"contexts": contexts
}
Step 8: Build the RAG Pipeline
Wire everything together in rag.py:
def main():
print("๐ RAG System\n")
# 1. Load
print("1. Loading documents...")
docs = load_documents()
# 2. Chunk
print("2. Chunking...")
chunks = chunk_documents(docs)
print(f" Created {len(chunks)} chunks")
# 3. Setup DB
print("3. Setting up ChromaDB...")
collection = setup_chroma()
# 4. Index (skip if already indexed)
if collection.count() == 0:
print("4. Indexing chunks...")
index_chunks(collection, chunks)
else:
print(f"4. Using existing index ({collection.count()} chunks)")
# 5. Query loop
print("\nโ
Ready! Ask questions (or type 'quit'):\n")
while True:
question = input("Q: ").strip()
if question.lower() in ["quit", "exit"]:
break
result = query_rag(collection, question)
print(f"\n๐ค {result['answer']}\n")
print(f"๐ Sources: {', '.join(result['sources'])}\n")
if __name__ == "__main__":
main()
Run it:
python rag.py
Try these questions:
Q: What is RAG?
Q: What are the challenges in AI video generation?
Q: What is the market size for AI video?
Step 9: Evaluate Quality
RAG systems fail silently. Add evaluation:
def evaluate_answer(question, answer, contexts):
"""Simple relevance check"""
# Check if answer contains information from context
for ctx in contexts:
ctx_words = set(ctx.lower().split())
answer_words = set(answer.lower().split())
overlap = len(ctx_words & answer_words) / len(ctx_words)
if overlap > 0.3:
return "good"
return "check needed"
๐ก For production: Use an LLM-as-judge pattern. Ask GPT-4 to rate answer relevance 1โ5. Log failures and improve chunking/retrieval.
Step 10: Production Tips
- Hybrid search โ Combine vector similarity + keyword matching (BM25) for better results
- Re-ranking โ Retrieve 20 chunks, then use a cross-encoder to pick the top 5
- Query rewriting โ Expand "it" and "that" in follow-up questions using conversation history
- Metadata filtering โ Filter by date, author, or document type before vector search
- Incremental updates โ Only re-index changed documents, not everything
โ ๏ธ Common failure mode: If your chunks are too long, the LLM ignores the middle. If too short, context is lost. Test with your actual documents and adjust chunk_size.
๐ Next Steps
- Add a web interface โ Wrap in Streamlit or Gradio for a chat UI
- Connect to APIs โ Ingest from Notion, Confluence, or Google Drive
- Multi-modal RAG โ Add image understanding with CLIP embeddings
- Agentic RAG โ Combine with our AI Agent tutorial for self-correcting retrieval