AI is never programmed with facts or rules. It learns by making predictions, measuring how wrong they were, and adjusting billions of tiny settings to be less wrong next time. Repeated enough, mistakes become knowledge.
Course: Beginner.
This lesson covers 5 concepts: A Training Example, 570 Billion Sentences, The Wrong Guess, Learning from the Mistake, After Billions of Steps.
During training, AI sees billions of sentences — but always with the last word hidden. Its job is to predict what comes next. Here it must predict that "Paris" follows "The Eiffel Tower is located in".
This single task — predict the next word — is everything AI ever trains on. All its knowledge of facts, language, and reasoning emerges from getting better at this one challenge.
Like the world's longest fill-in-the-blank test — billions of sentences, each missing its last word. AI's only job is to keep getting better at guessing the missing word.
"The Eiffel Tower is located in ___" — early in training, the model gets this wrong. After billions of examples, it gets it right. The path between wrong and right is training.
AI trains on more text than any human could read in a million lifetimes. Every sentence becomes a prediction exercise — and every wrong prediction becomes a lesson.
Scale is what makes modern AI work. The same training loop applied to a thousand sentences produces nothing useful. Applied to 570 billion words, it produces something that appears to understand the world.
Imagine doing 570 billion fill-in-the-blank exercises. You would not just memorise answers — you would start to understand how ideas connect, how language works, how the world fits together.
GPT-3 trained on 570 billion words. LLaMA 3 on 15 trillion tokens. The capability gap between a small and a large model comes almost entirely from this difference in scale.
Early in training, the model guesses Rome at 42% — wrong. Paris, the correct answer, only scores 31%. The model has not seen enough examples yet to be confident about where the Eiffel Tower is.
This wrong guess is where learning begins. The gap between "Rome won" and "Paris should have won" is the signal that drives every weight adjustment in the entire network.
Like a student on day one who guesses "Rome" because they know it is a famous European capital — close, but wrong. The mistake is the starting point for learning.
Rome 42%, Paris 31%, Berlin 16%. The model knows the answer is a European capital — it just hasn't seen enough Eiffel Tower sentences to know which one yet.
The error flows back through the entire network — nudging every single weight to make Paris slightly more likely next time. Then the model moves on to the next sentence and does it again.
This loop — predict, measure error, adjust every weight, repeat — is the entire mechanism of training. 570 billion sentences. Each one changes the model slightly.
Like a child adjusting their balance after falling off a bike — the correction is tiny each time, but each fall teaches something. After thousands of falls, balance becomes automatic.
After this step: the weights connecting "Eiffel Tower" context to "Paris" increase by roughly 0.00001. Tiny alone. Multiplied by billions of Eiffel Tower sentences, Paris becomes obvious.
After billions of training steps, the same question produces a completely different distribution. Paris now wins at 94%. The model learned — by correcting this mistake, and billions of others just like it.
This is what training achieves: not programming, not memorising — gradual improvement through repeated correction until the patterns become deeply embedded in the weights.
The same student, after studying all semester. Same question, completely different answer. Learning happened not through programming, but through billions of tiny corrections.
Before training: Rome 42%, Paris 31%. After training: Paris 94%, Rome 3%. No one told the model the Eiffel Tower is in Paris. It figured it out through repetition and correction.