📚 RAG · Intermediate Tutorial

Build a Production RAG System

⏱️ 60 minutes 📋 10 steps 💻 Python 3.10+ 🔄 Last updated: June 2026

Most RAG tutorials use LangChain and hide the complexity. We'll build it from scratch so you understand every piece: document loading, chunking, embeddings, vector search, and prompt augmentation. By the end, you'll have a system that can answer questions from your own documents.

What You'll Build Prerequisites Step 1: How RAG Works Step 2: Project Setup Step 3: Load Documents Step 4: Chunk Documents Step 5: Create Embeddings Step 6: Store in ChromaDB Step 7: Retrieve & Query Step 8: Build the RAG Pipeline Step 9: Evaluate Quality Step 10: Production Tips Next Steps

🎯 What You'll Build

A command-line RAG system that:

Loads PDFs, text files, and markdown documents
Chunks them into optimal-sized pieces (not too big, not too small)
Creates embeddings using OpenAI's embedding model
Stores them in ChromaDB (local vector database)
Retrieves relevant chunks when you ask a question
Feeds those chunks to GPT-4o-mini for an accurate, cited answer

📋 Prerequisites

Python 3.10+ installed
OpenAI API key
Some documents to test with (PDFs, .txt, or .md files)
~$1–2 in API credits for embeddings

Step 1: How RAG Works

Before coding, understand the pipeline:

Ingest → Load documents from files
Chunk → Split into paragraphs/sentences
Embed → Convert chunks to vectors (embeddings)
Store → Save vectors in a vector database
Query → Convert question to vector, find nearest neighbors
Generate → Feed relevant chunks + question to LLM

💡 The magic: Embeddings capture semantic meaning. "Car" and "automobile" have similar vectors even though the words are different. This lets us find relevant documents even without keyword matching.

Step 2: Project Setup

mkdir rag-tutorial && cd rag-tutorial python -m venv venv source venv/bin/activate # Windows: venv\Scripts\activate pip install openai chromadb python-dotenv pypdf touch rag.py .env mkdir documents

Add your API key to .env:

OPENAI_API_KEY=sk-your-key-here

Add sample documents to the documents/ folder. If you don't have any, create a sample:

echo "AI video generation has advanced rapidly. Models like Sora, Runway Gen-3, and Kling can create realistic video from text prompts. Key challenges include temporal consistency, prompt adherence, and compute cost. The market is expected to reach $1.8B by 2027." > documents/ai-video.txt echo "Retrieval-Augmented Generation (RAG) combines information retrieval with text generation. Instead of relying on parametric knowledge, RAG systems fetch relevant documents and include them in the prompt. This reduces hallucinations and enables knowledge cutoff extension." > documents/rag-overview.txt

Step 3: Load Documents

Create rag.py and add document loading:

import os from pathlib import Path def load_documents(directory="documents"): """Load all .txt, .md, and .pdf files from a directory""" docs = [] for filepath in Path(directory).glob("*"): if filepath.suffix in [".txt", ".md"]: with open(filepath, "r", encoding="utf-8") as f: docs.append({ "content": f.read(), "source": str(filepath), "type": filepath.suffix }) elif filepath.suffix == ".pdf": # Requires: pip install pypdf from pypdf import PdfReader reader = PdfReader(str(filepath)) text = "\n".join(page.extract_text() or "" for page in reader.pages) docs.append({"content": text, "source": str(filepath), "type": ".pdf"}) return docs # Test if __name__ == "__main__": docs = load_documents() print(f"Loaded {len(docs)} documents") for d in docs: print(f" - {d['source']}: {len(d['content'])} chars")

💡 For production: Add error handling, support for Word docs, web pages (via requests + BeautifulSoup), and database connectors.

Step 4: Chunk Documents

Chunking is the most important step in RAG. Too big = irrelevant info dilutes the answer. Too small = loses context.

def chunk_text(text, chunk_size=500, overlap=50): """Split text into overlapping chunks""" chunks = [] start = 0 while start < len(text): end = start + chunk_size chunk = text[start:end] # Try to break at sentence or paragraph boundary if end < len(text): # Look for period, newline, or space near the end for breaker in ["\n\n", ". ", " "]: idx = chunk.rfind(breaker) if idx > chunk_size * 0.7: # Only break if we're past 70% of chunk chunk = chunk[:idx + len(breaker)] end = start + len(chunk) break chunks.append(chunk.strip()) start = end - overlap return chunks # Chunk all documents def chunk_documents(docs): """Chunk all documents into pieces with metadata""" chunks = [] for doc in docs: text_chunks = chunk_text(doc["content"]) for i, chunk in enumerate(text_chunks): chunks.append({ "text": chunk, "source": doc["source"], "chunk_index": i, "total_chunks": len(text_chunks) }) return chunks

💡 Chunk size rules of thumb:
• 200–400 tokens for precise Q&A
• 500–1000 tokens for summaries and broad questions
• 20–50% overlap prevents cutting sentences in half

Step 5: Create Embeddings

Embeddings convert text into vectors. We'll use OpenAI's text-embedding-3-small — fast, cheap, and high quality.

import os from openai import OpenAI from dotenv import load_dotenv load_dotenv() client = OpenAI(api_key=os.getenv("OPENAI_API_KEY")) def create_embeddings(texts, batch_size=100): """Create embeddings for a list of texts""" all_embeddings = [] for i in range(0, len(texts), batch_size): batch = texts[i:i + batch_size] response = client.embeddings.create( model="text-embedding-3-small", input=batch ) batch_embeddings = [item.embedding for item in response.data] all_embeddings.extend(batch_embeddings) print(f"Embedded batch {i//batch_size + 1}/{(len(texts)-1)//batch_size + 1}") return all_embeddings

💡 Cost: text-embedding-3-small costs ~$0.02 per 1M tokens. Embedding 100 pages costs pennies. For 1,000 pages, budget $1–2.

Step 6: Store in ChromaDB

ChromaDB is a local vector database — no server needed, just Python.

import chromadb from chromadb.utils.embedding_functions import OpenAIEmbeddingFunction def setup_chroma(): """Initialize ChromaDB with OpenAI embeddings""" chroma_client = chromadb.PersistentClient(path="./chroma_db") embedding_fn = OpenAIEmbeddingFunction( api_key=os.getenv("OPENAI_API_KEY"), model_name="text-embedding-3-small" ) collection = chroma_client.get_or_create_collection( name="documents", embedding_function=embedding_fn, metadata={"hnsw:space": "cosine"} ) return collection def index_chunks(collection, chunks): """Store chunks in ChromaDB""" texts = [c["text"] for c in chunks] ids = [f"chunk_{i}" for i in range(len(chunks))] metadatas = [{"source": c["source"], "index": c["chunk_index"]} for c in chunks] # Add in batches batch_size = 100 for i in range(0, len(texts), batch_size): collection.add( ids=ids[i:i+batch_size], documents=texts[i:i+batch_size], metadatas=metadatas[i:i+batch_size] ) print(f"Indexed batch {i//batch_size + 1}") print(f"✅ Indexed {len(chunks)} chunks")

💡 Persistence: ChromaDB saves to ./chroma_db. Next time you run the script, it loads existing data. Delete the folder to re-index from scratch.

Step 7: Retrieve & Query

Now the fun part — asking questions:

def query_rag(collection, question, n_results=3): """Retrieve relevant chunks and generate answer""" # 1. Retrieve relevant chunks results = collection.query( query_texts=[question], n_results=n_results ) contexts = [] for doc, meta in zip(results["documents"][0], results["metadatas"][0]): contexts.append(f"[Source: {meta['source']}]\n{doc}") context_text = "\n\n---\n\n".join(contexts) # 2. Generate answer with context response = client.chat.completions.create( model="gpt-4o-mini", messages=[ {"role": "system", "content": "You are a helpful assistant. Answer based ONLY on the provided context. If the answer isn't in the context, say 'I don't have enough information.' Cite sources."}, {"role": "user", "content": f"Context:\n{context_text}\n\nQuestion: {question}"} ] ) return { "answer": response.choices[0].message.content, "sources": [m["source"] for m in results["metadatas"][0]], "contexts": contexts }

Step 8: Build the RAG Pipeline

Wire everything together in rag.py:

def main(): print("📚 RAG System\n") # 1. Load print("1. Loading documents...") docs = load_documents() # 2. Chunk print("2. Chunking...") chunks = chunk_documents(docs) print(f" Created {len(chunks)} chunks") # 3. Setup DB print("3. Setting up ChromaDB...") collection = setup_chroma() # 4. Index (skip if already indexed) if collection.count() == 0: print("4. Indexing chunks...") index_chunks(collection, chunks) else: print(f"4. Using existing index ({collection.count()} chunks)") # 5. Query loop print("\n✅ Ready! Ask questions (or type 'quit'):\n") while True: question = input("Q: ").strip() if question.lower() in ["quit", "exit"]: break result = query_rag(collection, question) print(f"\n🤖 {result['answer']}\n") print(f"📎 Sources: {', '.join(result['sources'])}\n") if __name__ == "__main__": main()

Run it:

python rag.py

Try these questions:

Q: What is RAG? Q: What are the challenges in AI video generation? Q: What is the market size for AI video?

Step 9: Evaluate Quality

RAG systems fail silently. Add evaluation:

def evaluate_answer(question, answer, contexts): """Simple relevance check""" # Check if answer contains information from context for ctx in contexts: ctx_words = set(ctx.lower().split()) answer_words = set(answer.lower().split()) overlap = len(ctx_words & answer_words) / len(ctx_words) if overlap > 0.3: return "good" return "check needed"

💡 For production: Use an LLM-as-judge pattern. Ask GPT-4 to rate answer relevance 1–5. Log failures and improve chunking/retrieval.

Step 10: Production Tips

Hybrid search — Combine vector similarity + keyword matching (BM25) for better results
Re-ranking — Retrieve 20 chunks, then use a cross-encoder to pick the top 5
Query rewriting — Expand "it" and "that" in follow-up questions using conversation history
Metadata filtering — Filter by date, author, or document type before vector search
Incremental updates — Only re-index changed documents, not everything

⚠️ Common failure mode: If your chunks are too long, the LLM ignores the middle. If too short, context is lost. Test with your actual documents and adjust chunk_size.

🚀 Next Steps

Add a web interface — Wrap in Streamlit or Gradio for a chat UI
Connect to APIs — Ingest from Notion, Confluence, or Google Drive
Multi-modal RAG — Add image understanding with CLIP embeddings
Agentic RAG — Combine with our AI Agent tutorial for self-correcting retrieval

← All Tutorials Build an AI Agent →