Static embeddings give every word one fixed vector — which breaks for words like "bank" that mean different things in different sentences. Transformers fix this with contextual embeddings: attention reshapes each word's vector based on its neighbours, resolving ambiguity automatically.
Course: Moderate.
This lesson covers 5 concepts: One Word, One Vector, The Ambiguity Problem, Context Transforms the Vector, bank + river context, What Embeddings Power.
Every token maps to a vector — thousands of numbers encoding its meaning. These 24 bars sample "bank"'s full 4,096 dimensions. Green bars are positive activations, red bars negative.
This vector is the only form of meaning the model works with. But for "bank", one fixed vector must somehow encode two very different meanings — and that causes problems.
A fingerprint with thousands of measurements instead of ridge patterns. No two words have the same fingerprint — but "bank" has a problem: it has two very different meanings squeezed into one fingerprint.
"bank" as 4,096 floats: some dimensions encode financial concepts, others encode physical structures near water. All mixed together in one static vector.
The static "bank" vector is almost equally close to financial terms and river terms. Without context, it cannot distinguish — because a single fixed vector genuinely cannot represent two meanings at once.
This ambiguity propagates through every downstream calculation. When the model reasons about "bank", it works with a blended meaning that is half-financial, half-geographical.
Like knowing two people named Alex — a doctor and a teacher. If someone says "call Alex", you need more information. Static embeddings have this problem with every word that has multiple meanings.
savings=0.81, finance=0.78 AND river=0.76, shore=0.73 — four very different concepts, all nearly equidistant. The static embedding is stuck in the middle.
Transformers fix the ambiguity problem by running attention before finalising any word's meaning. Each token's vector gets updated by what surrounds it — dynamically reshaping its position in meaning-space.
This is the fundamental advance of the transformer over Word2Vec. The same model produces entirely different representations for the same word in different sentences.
Static embeddings are like a dictionary with one entry per word. Contextual embeddings read the whole sentence first, then decide what the word means right now in this specific context.
"bank" in "The river bank flooded" and in "The savings bank closed" start with identical vectors — and end up in completely different regions of meaning-space after attention.
After attention processes "The river bank flooded", "bank"'s vector has absorbed context from "river" and "flooded". It now sits firmly with geographical terms — financial meanings suppressed to near zero.
This disambiguation is automatic and continuous — the model never explicitly identifies "bank" as ambiguous. Attention naturally pulls meaning from neighbours.
Like a chameleon — "bank" changed to match its surroundings. In a sentence about rivers, it became a river word. In a sentence about money, it would become a finance word.
river=0.93, shore=0.88, flood=0.82, water=0.79 — and savings drops to 0.07. The financial meaning essentially disappears. Context resolved the ambiguity without a single rule.
Contextual embeddings are the foundation of modern AI applications beyond chat. Semantic search, RAG, recommendations, and classifiers all work by computing distances and similarities in embedding space.
Embeddings turn qualitative meaning into quantitative geometry — making it possible to build search, retrieval, and recommendation systems without any hand-coded rules.
Once meaning is a point in space, every problem of finding similar things becomes a problem of finding nearby points. And computers are extraordinarily fast at finding nearby points.
Google Semantic Search, Spotify recommendations, Notion AI search, every RAG pipeline — all built on the same idea: find what is geometrically closest in embedding space.