RAG — Retrieval-Augmented Generation

RAG — Retrieval-Augmented Generation — CRIN

A model's knowledge is frozen at training time. RAG gives it access to any document, database, or knowledge source at query time — by embedding documents into a vector store, retrieving the most relevant chunks when a question arrives, and injecting them into the prompt before generation.

Course: Moderate.

This lesson covers 5 concepts: The Problem RAG Solves, Step 1 — Chunk and Embed, Step 2 — Vector Store Search, Step 3 — Retrieve and Generate, When RAG Works — and Fails.

The Problem RAG Solves

A base model cannot answer questions about your company's specific policies, recent events, or proprietary data — it was never trained on them. RAG gives it access to any document at query time.

Without RAG, every company deploying AI either fine-tunes on their data (expensive, slow) or accepts that the model cannot answer company-specific questions. RAG is the efficient middle path.

The model is like a brilliant consultant who knows everything about the world — but nothing about your specific company. RAG is like handing them your company handbook before each meeting.

"What is our refund policy?" — the base model cannot know this. It was never trained on your policy document. RAG retrieves it and injects it into the context before the model answers.

Step 1 — Chunk and Embed

Documents are split into overlapping chunks (~500 tokens each), then each chunk is converted into an embedding vector. These vectors are stored in a vector database indexed for fast similarity search.

Chunk size is a critical hyperparameter: too small loses context; too large dilutes the semantic signal. Overlap between chunks prevents information loss at boundaries.

Like creating an index for a book — split it into topics, give each topic a location code, and store the codes in an index. When you need something, look up the code and go straight to the right page.

A 50-page product manual: split into ~200 chunks of 500 tokens each with 50-token overlap → embed each chunk → store 200 vectors. At query time: embed the question → find the 3 closest vectors → retrieve those 3 chunks.

Step 2 — Vector Store Search

The question is embedded into a vector, then the vector store finds the chunks with the highest cosine similarity. "Refund policy: digital products, 14 days" scores 0.94 — clearly the right chunk.

The top-k retrieved chunks are what get injected into the prompt. The model never sees the full document — only the most relevant pieces.

Like a library search that understands meaning, not just keywords. "Refund policy for digital products" finds "Digital products may be refunded within 14 days" — even without identical words.

Query: "refund digital products" → embed → search → top result: "Refund policy: digital products, 14 days" (0.94). Privacy policy scores 0.09 — correctly excluded. Only relevant content reaches the model.

Step 3 — Retrieve and Generate

The retrieved chunks are injected into the system prompt as context. The model now has the specific information it needs — and generates a precise, grounded answer instead of a generic guess.

The answer is traceable: every claim in the response can be sourced back to a retrieved chunk. This is RAG's key advantage over fine-tuning — verifiability.

Like handing someone the answer key before an exam. The model's job is now to extract the relevant answer from the provided context — not to recall it from memory.

"Digital products can be refunded within 14 days... partial refunds available after 14 days if unused." Every word grounded in the retrieved policy document. No hallucination. No guessing.

When RAG Works — and Fails

RAG excels at factual Q&A over specific documents and staying current without retraining. It struggles when questions require synthesising many documents simultaneously or when the answer cannot be retrieved because it is implied rather than stated.

Knowing when not to use RAG is as important as knowing when to use it. Expensive retrieval for questions the model already knows is waste. Trusting RAG for synthesis-heavy questions leads to missed context.

RAG is a very good librarian — excellent at fetching the right book and finding the relevant page. But it is not an analyst — it cannot draw inferences across a hundred books simultaneously.

Good RAG: "What is our 14-day refund policy?" — single chunk, exact answer. Bad RAG: "What are the three most contradictory policies across our 50 departments?" — requires global synthesis RAG cannot do.