Every token produces three vectors — Query, Key, Value. Attention scores every Q·K pair, scales by √d_k, applies softmax, then blends the Value vectors by those weights. Run this 32 times in parallel with different weight matrices and you get multi-head attention — each head learning a different type of relationship.
Course: Moderate.
This lesson covers 5 concepts: Q, K, V — Three Projections, Head 1 — Semantic Roles, "cat" Attends To..., 32 Heads in Parallel, Head 2 — Coreference.
Every token produces three vectors from its embedding: a Query (what it needs), a Key (what it offers), and a Value (what it shares). These three projections are the entire foundation of attention.
Separating Q, K, and V lets the model learn three distinct functions independently — asking, advertising, and contributing — rather than collapsing them into one ambiguous vector.
Like three business cards for the same person: one saying what they need, one saying what they know, one saying what they can give. Attention matches needs to knowledge, then distributes contributions.
"cat" Q: encodes "I need to know what action involved me." "cat" K: encodes "I am a recipient of actions." "cat" V: carries the semantic content others borrow when attending to cat.
Head 1 for "She fed her cat": "fed" focuses on "cat" (0.37) — verb attending to its object. "cat" focuses on "fed" (0.41) — patient attending to its action. Semantic roles captured automatically.
This matrix is recomputed dynamically for every sentence. Same words, different order or context → completely different matrix.
Like a diagram showing which words are doing important work together. "fed" and "cat" are tightly linked because one is doing something to the other.
"fed"→"cat" (0.37): verb identifies its direct object. "her"→"cat" (0.35): possessive modifier attends to the possessed noun. Grammatical structure emerges from learned geometry.
"cat" row isolated: 41% on "fed", 22% on "her", 19% on "She", 18% on itself. The new contextual embedding for "cat" will be a weighted blend with these exact proportions.
After this blend, "cat"'s embedding has absorbed 41% of "fed"'s meaning. It now knows it is the thing that was fed — context fully integrated.
Like a recipe: "cat"'s new meaning is 41% "fed", 22% "her", 19% "She", 18% itself. The dominant ingredient is "fed" — because what happened to the cat matters most here.
o_cat = 0.41·V_fed + 0.22·V_her + 0.19·V_She + 0.18·V_cat. The output is shaped most by "fed" — "cat" has absorbed the context of being the direct object of a feeding action.
Multi-head attention runs H independent attention operations simultaneously, each with different learned projections. Each head develops a different specialisation — all in one forward pass.
A single head cannot simultaneously capture verb-object syntax and pronoun-antecedent coreference. Multi-head parallelism solves this — each head focuses on what its weights have learned to find.
Like having 32 experts read the same sentence simultaneously. Each expert highlights different connections. Together they catch far more than any single reader could.
Head 1: fed→cat (verb-object). Head 2: her→She (coreference). Head 3: might capture She→fed (agent-action). 32 different patterns, all computed at once, combined into one rich output.
Head 2, same sentence: "her" now attends 78% to "She". This head has learned to resolve pronoun-antecedent references — completely different from Head 1's verb-object focus.
This is the power of multi-head: 32 independent analyses captured simultaneously. Head 1 handles semantic roles. Head 2 handles coreference. No single head needs to do everything.
Same sentence, same words, same mechanism — completely different pattern, because each head has learned different weights that highlight different types of connections.
Head 1: fed→cat (0.37). Head 2: her→She (0.78). Combined output carries both facts: cat was fed AND the cat belongs to She. Both relationships encoded in one forward pass.