Tokenization (Deep)

Tokenization (Deep) — CRIN

Tokenization (Deep)

You know tokens split text into pieces. Now see how the vocabulary is actually built — Byte-Pair Encoding merges the most common character pairs 50,000 times until common words earn their own token. And why token count is the unit that drives every cost, limit, and performance number in AI.

Course: Moderate.

This lesson covers 5 concepts: Every Character = One Token, BPE Builds the Vocabulary, Frequent Pairs Win, After 50,000 Merges, Tokens = Cost + Context.

Every Character = One Token

Before any vocabulary is built, every individual character is its own token. "Unhappiness" becomes 11 tokens — one per letter. This is the starting point BPE improves on.

Character-level tokenization makes sequences too long. Attention costs grow quadratically with sequence length — 11 tokens where 2 would do costs 30x more computation.

Imagine texting someone by sending one letter at a time instead of whole words. Character-level tokenization is exactly that — it works, but it wastes enormous space.

"unhappiness" = 11 character tokens. A single 500-word essay would become ~2,500 tokens at character level. The same text is ~375 tokens with BPE. That is a 6x cost difference.

BPE Builds the Vocabulary

BPE builds the vocabulary by finding the most common pair of adjacent characters in a massive corpus, merging them into one token, and repeating 50,000 times.

BPE guarantees every common word or syllable gets an efficient single token, while rare or novel words are always representable as smaller known pieces.

Like snapping Lego bricks together from smallest to largest — start with individual letters, then join the most common pairs, then join those pairs with their neighbours, until you have useful words.

Early merge: "e"+"r" appears in her, later, water, father, better — millions of times in any English corpus. It earns a single token. That token then appears in thousands of other words.

Frequent Pairs Win

BPE scans the entire training corpus and ranks every adjacent character pair by how often it appears. The top scorer — "pp" in English — gets merged into a single token first.

The pairs that win most often are the same patterns linguists would call productive morphemes — prefixes, suffixes, syllables that appear in thousands of different words.

It is a popularity contest for letter pairs. The most common ones win tokens first, then the next most common, working down the frequency list until the vocabulary is full.

"pp" earns a token because it appears in happy, apple, appear, apply, approve, supply — countless times. "in" earns one from in, into, within, interesting. The algorithm rediscovers morphology from raw frequency.

After 50,000 Merges

After 50,000 merge operations, "unhappiness" has gone from 11 character tokens to 2. "un" and "happiness" each became common enough to earn their own token — so the word reuses them.

This reuse is the elegance of BPE: rare or novel words are always representable as combinations of common pieces. Nothing is ever unrepresentable.

The word was never put in the dictionary directly — but its parts were. "un" came from unknown, unfit, unfair. "happiness" from happiness, happily, happier. "Unhappiness" just borrows both.

11 characters → 2 tokens: 5.5x compression. A 1,000-token context window at character level becomes ~180 tokens at BPE level. That 5.5x gap determines how much of a document fits in one conversation.

Tokens = Cost + Context

Every practical limit in AI is measured in tokens — not words, not characters. Context windows, API bills, response lengths, and inference speed all come down to token count.

Understanding token economics changes how you write prompts and evaluate costs — and explains quirks like why AI struggles with arithmetic on large numbers.

Tokens are like fuel. Every word you put in burns tokens. Every word AI generates burns tokens. The tank has a fixed size — and the pump charges per litre.

GPT-4 Turbo: $10 per million input tokens. A 1-page prompt ≈ 500 tokens = $0.005. A 100-page research paper ≈ 50,000 tokens = $0.50. The meter is always running in tokens.