๐Ÿ“ฐ News ๐ŸŽ“ Learn ๐Ÿ”จ Tutorials ๐Ÿง  AI Concepts ๐Ÿ“š Courses ๐Ÿ† Certifications ๐Ÿ› ๏ธ AI Tools ๐Ÿ“ก Social Feed โญ GitHub Repos ๐Ÿ”Œ MCP Servers โš™๏ธ Implementations ๐Ÿค– AI Agents ๐Ÿ’ฐ Cost Calculator ๐Ÿ› ๏ธ Stack Builder ๐Ÿ“ˆ Trending Repos ๐Ÿ“ฌ Newsletter

Build a Production RAG System

โฑ๏ธ 60 minutes ๐Ÿ“‹ 10 steps ๐Ÿ’ป Python 3.10+ ๐Ÿ”„ Last updated: June 2026

Most RAG tutorials use LangChain and hide the complexity. We'll build it from scratch so you understand every piece: document loading, chunking, embeddings, vector search, and prompt augmentation. By the end, you'll have a system that can answer questions from your own documents.

Table of Contents

What You'll Build Prerequisites Step 1: How RAG Works Step 2: Project Setup Step 3: Load Documents Step 4: Chunk Documents Step 5: Create Embeddings Step 6: Store in ChromaDB Step 7: Retrieve & Query Step 8: Build the RAG Pipeline Step 9: Evaluate Quality Step 10: Production Tips Next Steps

๐ŸŽฏ What You'll Build

A command-line RAG system that:

๐Ÿ“‹ Prerequisites

Step 1: How RAG Works

Before coding, understand the pipeline:

  1. Ingest โ†’ Load documents from files
  2. Chunk โ†’ Split into paragraphs/sentences
  3. Embed โ†’ Convert chunks to vectors (embeddings)
  4. Store โ†’ Save vectors in a vector database
  5. Query โ†’ Convert question to vector, find nearest neighbors
  6. Generate โ†’ Feed relevant chunks + question to LLM
๐Ÿ’ก The magic: Embeddings capture semantic meaning. "Car" and "automobile" have similar vectors even though the words are different. This lets us find relevant documents even without keyword matching.

Step 2: Project Setup

mkdir rag-tutorial && cd rag-tutorial python -m venv venv source venv/bin/activate # Windows: venv\Scripts\activate pip install openai chromadb python-dotenv pypdf touch rag.py .env mkdir documents

Add your API key to .env:

OPENAI_API_KEY=sk-your-key-here

Add sample documents to the documents/ folder. If you don't have any, create a sample:

echo "AI video generation has advanced rapidly. Models like Sora, Runway Gen-3, and Kling can create realistic video from text prompts. Key challenges include temporal consistency, prompt adherence, and compute cost. The market is expected to reach $1.8B by 2027." > documents/ai-video.txt echo "Retrieval-Augmented Generation (RAG) combines information retrieval with text generation. Instead of relying on parametric knowledge, RAG systems fetch relevant documents and include them in the prompt. This reduces hallucinations and enables knowledge cutoff extension." > documents/rag-overview.txt

Step 3: Load Documents

Create rag.py and add document loading:

import os from pathlib import Path def load_documents(directory="documents"): """Load all .txt, .md, and .pdf files from a directory""" docs = [] for filepath in Path(directory).glob("*"): if filepath.suffix in [".txt", ".md"]: with open(filepath, "r", encoding="utf-8") as f: docs.append({ "content": f.read(), "source": str(filepath), "type": filepath.suffix }) elif filepath.suffix == ".pdf": # Requires: pip install pypdf from pypdf import PdfReader reader = PdfReader(str(filepath)) text = "\n".join(page.extract_text() or "" for page in reader.pages) docs.append({"content": text, "source": str(filepath), "type": ".pdf"}) return docs # Test if __name__ == "__main__": docs = load_documents() print(f"Loaded {len(docs)} documents") for d in docs: print(f" - {d['source']}: {len(d['content'])} chars")
๐Ÿ’ก For production: Add error handling, support for Word docs, web pages (via requests + BeautifulSoup), and database connectors.

Step 4: Chunk Documents

Chunking is the most important step in RAG. Too big = irrelevant info dilutes the answer. Too small = loses context.

def chunk_text(text, chunk_size=500, overlap=50): """Split text into overlapping chunks""" chunks = [] start = 0 while start < len(text): end = start + chunk_size chunk = text[start:end] # Try to break at sentence or paragraph boundary if end < len(text): # Look for period, newline, or space near the end for breaker in ["\n\n", ". ", " "]: idx = chunk.rfind(breaker) if idx > chunk_size * 0.7: # Only break if we're past 70% of chunk chunk = chunk[:idx + len(breaker)] end = start + len(chunk) break chunks.append(chunk.strip()) start = end - overlap return chunks # Chunk all documents def chunk_documents(docs): """Chunk all documents into pieces with metadata""" chunks = [] for doc in docs: text_chunks = chunk_text(doc["content"]) for i, chunk in enumerate(text_chunks): chunks.append({ "text": chunk, "source": doc["source"], "chunk_index": i, "total_chunks": len(text_chunks) }) return chunks
๐Ÿ’ก Chunk size rules of thumb:
โ€ข 200โ€“400 tokens for precise Q&A
โ€ข 500โ€“1000 tokens for summaries and broad questions
โ€ข 20โ€“50% overlap prevents cutting sentences in half

Step 5: Create Embeddings

Embeddings convert text into vectors. We'll use OpenAI's text-embedding-3-small โ€” fast, cheap, and high quality.

import os from openai import OpenAI from dotenv import load_dotenv load_dotenv() client = OpenAI(api_key=os.getenv("OPENAI_API_KEY")) def create_embeddings(texts, batch_size=100): """Create embeddings for a list of texts""" all_embeddings = [] for i in range(0, len(texts), batch_size): batch = texts[i:i + batch_size] response = client.embeddings.create( model="text-embedding-3-small", input=batch ) batch_embeddings = [item.embedding for item in response.data] all_embeddings.extend(batch_embeddings) print(f"Embedded batch {i//batch_size + 1}/{(len(texts)-1)//batch_size + 1}") return all_embeddings
๐Ÿ’ก Cost: text-embedding-3-small costs ~$0.02 per 1M tokens. Embedding 100 pages costs pennies. For 1,000 pages, budget $1โ€“2.

Step 6: Store in ChromaDB

ChromaDB is a local vector database โ€” no server needed, just Python.

import chromadb from chromadb.utils.embedding_functions import OpenAIEmbeddingFunction def setup_chroma(): """Initialize ChromaDB with OpenAI embeddings""" chroma_client = chromadb.PersistentClient(path="./chroma_db") embedding_fn = OpenAIEmbeddingFunction( api_key=os.getenv("OPENAI_API_KEY"), model_name="text-embedding-3-small" ) collection = chroma_client.get_or_create_collection( name="documents", embedding_function=embedding_fn, metadata={"hnsw:space": "cosine"} ) return collection def index_chunks(collection, chunks): """Store chunks in ChromaDB""" texts = [c["text"] for c in chunks] ids = [f"chunk_{i}" for i in range(len(chunks))] metadatas = [{"source": c["source"], "index": c["chunk_index"]} for c in chunks] # Add in batches batch_size = 100 for i in range(0, len(texts), batch_size): collection.add( ids=ids[i:i+batch_size], documents=texts[i:i+batch_size], metadatas=metadatas[i:i+batch_size] ) print(f"Indexed batch {i//batch_size + 1}") print(f"โœ… Indexed {len(chunks)} chunks")
๐Ÿ’ก Persistence: ChromaDB saves to ./chroma_db. Next time you run the script, it loads existing data. Delete the folder to re-index from scratch.

Step 7: Retrieve & Query

Now the fun part โ€” asking questions:

def query_rag(collection, question, n_results=3): """Retrieve relevant chunks and generate answer""" # 1. Retrieve relevant chunks results = collection.query( query_texts=[question], n_results=n_results ) contexts = [] for doc, meta in zip(results["documents"][0], results["metadatas"][0]): contexts.append(f"[Source: {meta['source']}]\n{doc}") context_text = "\n\n---\n\n".join(contexts) # 2. Generate answer with context response = client.chat.completions.create( model="gpt-4o-mini", messages=[ {"role": "system", "content": "You are a helpful assistant. Answer based ONLY on the provided context. If the answer isn't in the context, say 'I don't have enough information.' Cite sources."}, {"role": "user", "content": f"Context:\n{context_text}\n\nQuestion: {question}"} ] ) return { "answer": response.choices[0].message.content, "sources": [m["source"] for m in results["metadatas"][0]], "contexts": contexts }

Step 8: Build the RAG Pipeline

Wire everything together in rag.py:

def main(): print("๐Ÿ“š RAG System\n") # 1. Load print("1. Loading documents...") docs = load_documents() # 2. Chunk print("2. Chunking...") chunks = chunk_documents(docs) print(f" Created {len(chunks)} chunks") # 3. Setup DB print("3. Setting up ChromaDB...") collection = setup_chroma() # 4. Index (skip if already indexed) if collection.count() == 0: print("4. Indexing chunks...") index_chunks(collection, chunks) else: print(f"4. Using existing index ({collection.count()} chunks)") # 5. Query loop print("\nโœ… Ready! Ask questions (or type 'quit'):\n") while True: question = input("Q: ").strip() if question.lower() in ["quit", "exit"]: break result = query_rag(collection, question) print(f"\n๐Ÿค– {result['answer']}\n") print(f"๐Ÿ“Ž Sources: {', '.join(result['sources'])}\n") if __name__ == "__main__": main()

Run it:

python rag.py

Try these questions:

Q: What is RAG? Q: What are the challenges in AI video generation? Q: What is the market size for AI video?

Step 9: Evaluate Quality

RAG systems fail silently. Add evaluation:

def evaluate_answer(question, answer, contexts): """Simple relevance check""" # Check if answer contains information from context for ctx in contexts: ctx_words = set(ctx.lower().split()) answer_words = set(answer.lower().split()) overlap = len(ctx_words & answer_words) / len(ctx_words) if overlap > 0.3: return "good" return "check needed"
๐Ÿ’ก For production: Use an LLM-as-judge pattern. Ask GPT-4 to rate answer relevance 1โ€“5. Log failures and improve chunking/retrieval.

Step 10: Production Tips

โš ๏ธ Common failure mode: If your chunks are too long, the LLM ignores the middle. If too short, context is lost. Test with your actual documents and adjust chunk_size.
โ† All Tutorials Build an AI Agent โ†’