What Training Data Does

What Training Data Does — CRIN

The model knows exactly what was in its training corpus — and nothing else. Coverage depth determines knowledge depth. The data mix encodes bias. The cutoff date creates an invisible wall. And as AI-generated text fills the internet, a slow degradation problem called model collapse looms.

Course: Moderate.

This lesson covers 5 concepts: Your Data Is Your World, Coverage Shapes Competence, The Knowledge Cutoff, Biases Baked In, When AI Trains on AI.

Your Data Is Your World

The model knows exactly what was in its training corpus — and nothing else. Coverage depth maps directly to knowledge depth. More documents on a topic means lower loss means more reliable answers.

Understanding this explains every AI knowledge limitation: why it knows chemistry but not your company's internal process, why it knows Shakespeare but not your unpublished manuscript.

Imagine a person who learned everything they know from reading. They would know well-documented topics in rich detail and know barely documented topics poorly. The model is exactly that person.

AI knows Python very well: millions of GitHub repos, tutorials, Stack Overflow answers. It knows an obscure regional dialect poorly: a handful of web pages. Same model, dramatically different depth.

Coverage Shapes Competence

English web text dominates training corpora at 72%. Code is 13%. All other languages combined sit at 2%. The model's competence distribution mirrors this exactly — excellent at English prose, weakest at underrepresented languages.

The data mix is a design choice with direct downstream consequences. More non-English data produces a more multilingual model. More code produces a better programmer. The choice is explicit.

A student who studied mostly English literature knows English best. The same applies here — the more of something was in the training data, the better the model handles it.

English at 72%: near-native fluency, idioms, slang, nuance. All other languages at 2% combined: functional but often missing cultural context, regional dialects, and local knowledge.

The Knowledge Cutoff

Training data ends at a fixed date. Everything after is invisible — events, discoveries, new products, and personnel changes simply do not exist to the model. It cannot know what it was never trained on.

The model does not know when it does not know. It will extrapolate from pre-cutoff patterns when asked about post-cutoff events — which produces confident but potentially wrong answers.

Like an encyclopaedia printed on a specific date. Brilliant and comprehensive up to that point — completely silent on everything that happened the next day, but with no page that says "unknown after here."

Cutoff: April 2024. Ask about a September 2024 election: the model may confidently describe the pre-cutoff frontrunner from polling data, presenting a guess as current fact.

Biases Baked In

"World knowledge" in a model trained on English web text skews heavily toward US and UK facts (0.94), then European (0.81), falling to African (0.38) and South Asian (0.42) facts — not because those facts matter less, but because they are underrepresented in English text.

Bias is not a programming error — it is a faithful reflection of imbalanced training data. Fixing it requires changing the data, not patching the code.

The model is a mirror. If the text it trained on reflects a particular worldview, the model reflects that worldview too — amplified across trillions of tokens.

English-trained AI on African history: knows colonial history well (extensively documented in English). Knows pre-colonial African kingdoms poorly (underrepresented in English web text). Same model, different coverage, different accuracy.

When AI Trains on AI

As AI-generated text increasingly fills the internet, future training corpora will contain more of it. Research shows models trained on synthetic data progressively lose output diversity — a problem called model collapse.

Model collapse is slow and invisible — outputs still look reasonable. But the range of ideas, expression diversity, and rare-fact accuracy quietly narrows across generations.

Like photocopying a photocopy ten times — each copy introduces small distortions that accumulate. After enough generations, the original information has degraded toward a blurry average.

Shumailov et al. (2023): training on synthetic data causes output diversity to collapse. After 5 generations on the same base model's outputs: text diversity had fallen to a fraction of the original.