AI doesn't treat all words equally. Every word scores every other word for relevance, then blends their meanings using those scores — resolving ambiguity and tracking context across the whole sentence.
Course: Beginner.
This lesson covers 5 concepts: The Sentence, All Words, All at Once, The Attention Map, Attention as Votes, Context Resolved.
AI reads the whole sentence at once. The challenge: which words actually connect to which? Without knowing that, pronouns like "it" are meaningless.
Without knowing which words relate to which, AI cannot resolve ambiguity, track references, or understand context — the sentence stays surface-level.
Think of a mystery novel where you need to track who every "he", "she", and "it" refers to. If you lose track, the whole story falls apart.
"The animal didn't cross because it was tired." — What does "it" mean? The animal or the street? Every word needs context from other words to be understood.
Attention lets every word ask: which other words are most relevant to my meaning right now? Each word scores all others, then blends information based on those scores.
This simultaneous computation is what makes transformers so fast — every word considers every other word at once, not sequentially.
Like a group of students in a study room — each person reads everyone's notes and decides whose work is most useful for their specific question.
"it" scans every word, scores each for relevance, gives 82% weight to "animal", and blends mostly "animal"'s meaning into its own representation.
Each row is one word. Each column is another word. The brightness of a cell shows how much attention that row-word pays to that column-word. Look at the "it" row — almost entirely focused on "animal".
This matrix is computed fresh for every sentence. "It" near animals gets a very different pattern than "it" near machines or places.
Like a table of who is really listening to whom in a meeting. Each row is one person, each column is someone they might be paying attention to.
"it" row: animal=0.82, cross=0.08, it=0.06, tired=0.04. The brightest cell tells you exactly what "it" refers to — no rules needed, just learned geometry.
The raw scores for "it" get converted into weights that sum to exactly 1.0 — like a vote where "animal" wins with 82% of the ballot.
Forcing weights to sum to 1 prevents any word from being overwhelmed by noise — attention is always a focused allocation, not an unbounded score.
Like a voting system — each word casts its attention votes across all other words. The word with the most votes shapes the meaning the most.
"it" spends 82% of its attention on "animal", 8% on "cross", 6% on itself, 4% on "tired". The winner gets most of the meaning.
After attention, "it" carries a meaning vector shaped 82% by "animal". Its nearest semantic neighbours are now animal-related words — the ambiguity is completely resolved.
This is the core output of attention — every word leaves with a context-aware meaning. That richer vector is what all subsequent layers work with.
Like a blank sticky note that picked up writing from its nearest neighbour. "It" started as nothing and absorbed the meaning of "animal" through attention.
Before attention: "it" = generic pronoun. After: nearest words are animal (0.91), creature (0.85), being (0.79). "Street" does not appear anywhere near the top.