Training a large language model is an engineering challenge as much as a research one. Fifteen trillion tokens of curated data, 8,192 GPUs running for three months, a loss curve that starts at 6.4 and slowly grinds to 1.4 — and at the end, capabilities emerge that nobody explicitly trained for.
Course: Moderate.
This lesson covers 5 concepts: Building the Training Corpus, The Loss Curve, The Optimizer Loop, The Compute Bill, What Emerges From Scale.
Training begins with data — raw web crawls filtered, cleaned, tokenized, and batched into a trillion-token corpus. Data quality is the ceiling of what any model can learn.
Most capability differences between models come from data quality and mix, not architecture. Curation is where training actually begins.
Like preparing ingredients before cooking. The best recipe cannot fix bad ingredients — and no amount of training can fix a flawed corpus.
LLaMA 3 training data: ~80% web text (filtered CommonCrawl), 10% code (GitHub), 5% academic papers, 5% books. The specific mix shapes the model's strengths directly.
Loss starts high (random guessing) and falls sharply early, then flattens into a long slow grind. Each 0.1 reduction in the flat part represents a significant capability upgrade — the easy patterns are gone, only hard ones remain.
Loss is the single number that summarises what a model knows. Lower is better. It lets two models with different architectures be compared objectively on the same dataset.
Like a student's test scores over a semester. The first week's improvement is dramatic — basics learned fast. By the end, each extra percentage point requires enormous effort — but still matters.
Loss 6.4 → 4.2 (first 1B tokens): basic grammar. 4.2 → 2.9 (next 99B): facts and patterns. 2.9 → 1.4 (remaining 14.9T): nuance, abstraction, complex reasoning.
Every training step: compute loss, backpropagate gradients, clip if too large, update weights with AdamW using the current learning rate from the schedule. This loop runs billions of times.
A bad schedule can waste millions of dollars of compute. Getting training dynamics right is as important as the architecture choice — often more impactful on final quality.
The learning rate schedule is like a student's study intensity across a semester. Small careful steps at the start, faster in the middle when learning is efficient, slower at the end to solidify what was learned.
Without LR warmup: early large gradients corrupt random initialisation before useful patterns form. With warmup: the model establishes stable initial representations before full-speed learning begins.
Training LLaMA 3 405B required 8,192 H100 GPUs running for 30 million GPU-hours — roughly 3 months non-stop. Estimated cost: $50–100 million in compute alone.
Training cost determines who can build frontier models. At $50M+ per run, only a handful of organisations worldwide can attempt it — compute is a genuine competitive moat.
Frontier AI training is an infrastructure problem as much as a research one. The scale of compute is comparable to building and running a small power station.
8,192 GPUs × ~$2–3/hr × 30M GPU-hours ≈ $60–90M. GPT-3 (2020): ~$5M. LLaMA 3 405B (2024): ~$50–100M. GPT-4 (estimated): ~$100M+. The exponential cost curve continues.
Some capabilities were never explicitly trained — they emerged unpredictably as scale increased. In-context learning, multi-step reasoning, and instruction following appeared suddenly at certain thresholds.
Emergent capabilities are why scaling continues. Researchers cannot predict what new abilities will appear at the next tier — which creates a strong ongoing incentive to keep pushing scale.
Like a child who at some point can suddenly understand metaphors, or solve mental arithmetic without counting on fingers — not because they were explicitly taught, but because enough underlying learning accumulated.
GPT-2 (1.5B): cannot follow multi-step instructions. GPT-3 (175B): can. The capability did not improve gradually — it was essentially zero below threshold and functional above it.