AI doesn't read words the way you do. It splits every sentence into small pieces called tokens — then converts each piece into a number. That number sequence is all the model ever sees.
Course: Beginner.
This lesson covers 5 concepts: Your Sentence, AI's Vocabulary, Splitting Into Tokens, Token IDs, What AI Sees.
This is your raw text — words and punctuation exactly as you typed them. AI cannot process this directly.
AI is a math machine. Before it can do anything, every word must be converted into a number.
Imagine passing a note in a language someone does not speak. Before they can read it, it needs to be translated into a language they understand.
"Unhappy cats are misunderstood." — perfectly clear to you. To AI, this is about to become a sequence of seven integers.
AI has a fixed dictionary of around 100,000 entries called the vocabulary. Every word you write gets matched to entries in this list.
The vocabulary defines what AI can represent. Words not in the dictionary get broken into smaller pieces that are — nothing is ever lost.
Imagine a dictionary where instead of definitions, every word piece gets a unique number. That is the vocabulary.
GPT-4 uses 100,257 tokens. "the" has its own token. "misunderstood" gets split into smaller pieces it does know.
The sentence gets split into pieces — called tokens. Each token is a word, part of a word, a space, or punctuation.
This splitting step is how AI handles any text, including words it has never seen — broken into smaller known pieces, it can always find a match.
Like cutting a sentence with scissors — short everyday words stay whole, long or unusual words get snipped into smaller parts that the dictionary does know.
"Unhappy" → "Un" + "happy". "Misunderstood" → "mis" + "understood". Short common words like "cats" and "are" stay in one piece.
Each token piece gets swapped for its unique ID number — a single integer from AI's vocabulary dictionary.
This is the key transformation. AI is a math machine — it cannot process text, but it can process integers at extraordinary speed.
Like turning a message into a secret code where each word piece is swapped for its number. The code is consistent — same word always gets the same number.
"Un" → 1844. "happy" → 6380. "cats" → 16097. Seven tokens, seven numbers. The sentence is now a list of integers.
This is the sentence as AI actually sees it — not words, but a clean sequence of integers ready for processing.
Everything AI does — understanding, generating, reasoning — operates on sequences like this. Never on the original text.
Your beautiful sentence got translated into a shopping list of numbers. That is the trade-off: humans need words, computers need numbers.
[1844, 6380, 16097, 527, 5435, 18651, 13] — this is what "Unhappy cats are misunderstood." looks like to AI.