Building a RAG System

Building a RAG System — CRIN

RAG looks simple but the details determine whether it works in production. Chunking strategy affects recall. Embedding model choice affects semantic match quality. Re-ranking separates relevant from retrieved. And RAGAS gives you four metrics to know if your pipeline is actually working.

Course: Advanced.

This lesson covers 5 concepts: Chunking Strategy, Embedding Model Choice, Retrieval Pipeline, Re-ranking the Results, Evaluating with RAGAS.

Chunking Strategy

Chunking splits documents into retrievable units. Fixed-size (512 tokens, 50-token overlap) is the reliable default. Semantic chunking splits on meaning boundaries. Hierarchical stores small chunks for retrieval but injects parent sections for generation.

Wrong chunk size is the most common RAG failure. A 128-token chunk misses surrounding context. A 2,048-token chunk buries the answer in noise and wastes context budget.

Chunk size is a tradeoff: small chunks are easier to find but lack surrounding context. Large chunks have full context but the relevant sentence is harder to surface.

50-page legal document: semantic chunking on clause boundaries → 180 chunks of varying size → each clause is a complete, retrievable unit. Fixed-size would split clauses mid-sentence.

Embedding Model Choice

Embedding model quality determines how well semantic search works. text-embedding-3-large (0.94) leads on quality. e5-large-v2 (0.89) is free and nearly as good. bge-m3 (0.86) handles multilingual. all-MiniLM (0.61) is fast but weaker.

Embedding model choice has a larger impact on RAG quality than retrieval algorithm choice. Upgrading from ada-002 to text-embedding-3-large typically improves retrieval precision by 10–15%.

The query and all document chunks must use the exact same embedding model. Mixing models is like using different measurement units — the distances mean nothing.

MTEB benchmark (Muennighoff et al.): objective comparison of embedding models across 58 retrieval tasks. text-embedding-3-large ranks top 5 overall. e5-large-v2 matches it at zero cost for English-only use cases.

Retrieval Pipeline

Production retrieval combines BM25 (keyword) + dense (semantic) search via Reciprocal Rank Fusion. Query expansion and HyDE further improve recall. The goal: surface every relevant chunk, then let re-ranking sort quality.

BM25 finds exact terms dense search misses. Dense search finds semantic variants BM25 misses. Combining both is consistently better than either alone on diverse query distributions.

No single retrieval method is perfect. Combining keyword and semantic search with query expansion catches what any single method would miss — improving recall by 20–30% over dense-only search.

Query: "refund" → BM25 finds "refund", "refunds", "reimbursement" matches. Dense finds semantically related "money back", "return policy". Hybrid catches both classes of relevant chunks.

Re-ranking the Results

Re-ranking applies a cross-encoder to the top-k retrieved chunks, jointly scoring (query, chunk) pairs. The refund policy chunk scores 0.94 — confident match. Shipping returns scores 0.12 — correctly deprioritised despite lexical similarity.

Re-ranking typically improves precision by 15–25% over bi-encoder retrieval alone at the cost of k cross-encoder forward passes — worth it for the top-5 chunks that enter the generation context.

Retrieval is like a wide net — catches everything possibly relevant. Re-ranking is like sorting through the catch — keeps the fish, throws back the seaweed.

"Shipping and returns" retrieved with 0.68 cosine similarity to "refund digital products" — lexically similar but semantically different. Cross-encoder scores it 0.12 — correctly excluded from the final context.

Evaluating with RAGAS

RAGAS provides four complementary metrics: Faithfulness (are claims grounded in context?), Answer Relevance (does the response address the question?), Context Precision (are retrieved chunks useful?), Context Recall (were all necessary facts retrieved?). All four must be optimised together.

Without all four metrics, you optimise one dimension and degrade another silently. RAGAS is the minimum viable eval framework for any production RAG system.

Measuring only answer quality tells you the output is good but not why, or how to improve it. RAGAS tells you exactly which part of the pipeline — retrieval, re-ranking, or generation — is the problem.

Faithfulness 0.92, Answer Relevance 0.88, Context Precision 0.71, Context Recall 0.94: the bottleneck is Context Precision — retrieval is surfacing irrelevant chunks. Fix: improve re-ranking or tighten chunk embeddings.