The context window is the model's working memory — every token it can see at once. Longer windows enable longer conversations and documents, but memory cost grows quadratically and inference cost grows linearly. KV cache makes multi-turn conversation practical. RAG makes unlimited document length practical.
Course: Moderate.
This lesson covers 5 concepts: What the Context Window Is, What Fills the Window, The KV Cache, The Cost of Long Contexts, Working Within Limits.
The context window is every token the model can currently see — system prompt, conversation history, documents, your message, and its own output so far. It is the model's complete working memory.
When the window fills, the model loses access to the oldest content. It has no memory beyond the window — not faded memory, but complete absence.
Think of it like a whiteboard — the model can see and use everything on the whiteboard, but once it fills up, the oldest writing must be erased to make room for new content.
GPT-4: 128K tokens ≈ 96,000 words ≈ a 340-page book. Claude 3.5: 200K ≈ 150,000 words. Gemini 1.5 Pro: 1M ≈ 750,000 words ≈ 10 novels simultaneously.
A typical context window is shared between: system prompt (500 tokens), conversation history (2,000 tokens), an attached document (8,000 tokens), and the model's response (1,200 tokens) — 11,700 tokens total.
Context budgeting is a design discipline — how you allocate tokens between system instructions, history, documents, and response space directly determines what the model can and cannot do.
Think of it like a fixed-size meeting agenda. The system prompt books the first 500 tokens. The conversation history gets 2,000. The document gets 8,000. The response has 1,200. That is the budget.
A customer support bot with a 500-token system prompt, 2K conversation history, and 8K product docs: 10.5K tokens used. That leaves 117.5K for a GPT-4 deployment — or only 13.5K for a smaller model.
To generate each new token, the model must attend to all previous tokens. The KV cache stores the key and value projections for all prior tokens so they do not need to be recomputed — each new token only needs its own query.
The KV cache is what makes conversational AI practical. Without it, every new response token would be slower than the last — unusable at scale.
Like a running meeting transcript. Instead of re-reading all previous minutes before each new statement, the secretary just appends to the existing record. The KV cache is that record.
70B model at 128K context: KV cache ≈ 2 × n × d × L × bytes = 2 × 128K × 4096 × 80 × 2 bytes ≈ 160GB — larger than the model weights themselves.
128K context costs roughly 16× more than 8K context to process — not because prices are set arbitrarily, but because attention computation grows with sequence length and KV cache memory grows linearly.
Context length is not free. Building applications that use 128K+ context windows for every query will have dramatically higher inference costs than applications that manage context efficiently.
Every token in the context window has to be processed. More tokens means more computation, more memory, more cost. There is no free lunch with context length.
8K context: $0.08 per query at GPT-4 rates. 128K context: $1.28 per query — 16× higher. At 1M queries per month: $80K vs $1.28M. Context management is a cost management problem.
Four practical strategies for context management: RAG retrieves only relevant chunks (unlimited document scale), chunking splits long documents into processable pieces, summarisation compresses old conversation turns, and prompt compression removes redundancy.
RAG + chunking handle the majority of real-world cases where documents exceed context limits. Summarisation and compression handle the rest. Together, they make context limits a manageable engineering constraint, not a hard ceiling.
Context limits are a manageable constraint, not a fundamental barrier. Good engineering around context means you can effectively work with documents and conversations of any length.
A 500-page legal document at 250K tokens exceeds even the longest context windows. Chunking + RAG: retrieves the 3 most relevant clauses (2K tokens) and answers with full precision. Full context would be impossible and unnecessary.