A transformer block has two sublayers — attention and FFN — each followed by a residual connection. Attention handles communication between tokens. FFN handles computation within each token. Stack 32 of these and you get hierarchical abstraction: syntax at the bottom, reasoning at the top.
Course: Moderate.
This lesson covers 5 concepts: The Residual Stream, Sublayer 1 — Attention, Sublayer 2 — Feed-Forward, One Complete Block, 32 Layers = Abstraction.
Every transformer block takes the current token representations, applies a transformation, then adds the result back to the original. This x + f(x) pattern is called a residual connection — the highway of the transformer.
Residual connections solve the vanishing gradient problem — they let gradients flow directly from the output all the way back to the input without passing through every transformation.
Like editing a document by writing in the margins instead of erasing and rewriting. Each block adds its insights without wiping out what came before.
Without residuals, a 32-layer network often trains worse than an 8-layer one. With them, 32 layers reliably outperform 8 — depth becomes a reliable lever for capability.
The first sublayer is multi-head attention. Tokens communicate — "cat" picks up that it is the subject of "sat" (0.32). "sat" picks up that its actor is "cat" (0.49). The result is added back via residual.
Attention handles long-range dependencies — any two tokens can interact regardless of distance. This is what transformers do better than every predecessor architecture.
Attention is the conversation step. Before moving on, every token gets to check in with every other token and update its understanding of the sentence.
"cat" attends to "sat" (0.32) — picking up the action. "sat" attends to "cat" (0.49) — picking up the actor. After the residual add, both tokens carry context from each other.
The second sublayer is a Feed-Forward Network applied to each token independently. It expands the representation 4× (4096 → 16384), applies a non-linearity, then contracts back (16384 → 4096).
Research has identified individual FFN neurons that activate for specific facts — the association "Eiffel Tower → Paris" can be traced to specific weight rows. The FFN is the model's fact store.
If attention is where tokens talk to each other, the FFN is where each token thinks alone — processing everything it has learned and applying its stored knowledge.
4096 input → 16384 hidden (4× expansion creates room for complex transformations) → non-linear activation → 4096 output. No cross-token interaction — pure per-token computation.
One transformer block: Attention (communication) → residual → FFN (computation) → residual. The representation enters richer than it arrived. The original signal is always preserved in the residual path.
The alternation of global (attention) and local (FFN) operations is the discovered optimal structure for language modelling. Every major LLM uses this exact pattern.
Every block: first communicate, then compute. Talk then think. Do that 32 times and by the end, each token deeply understands its role in the sentence and the world.
Input "cat" → attention reads "sat" context → residual adds it → FFN applies stored knowledge → residual adds that → block done. 31 more blocks follow, each refining further.
Stacking 32 identical blocks — with different learned weights — produces hierarchical abstraction. Early layers handle tokens and syntax. Middle layers handle meaning and entities. Late layers handle reasoning and knowledge.
Depth is what gives large models their qualitative leap over small ones. More layers equals more rounds of communication and computation equals richer and more abstract representations.
Like reading a sentence once for the words, again for the grammar, again for the meaning, again for the implication. Each pass builds on everything that came before.
Probe layer 1: predicts part-of-speech well. Layer 12: predicts named entity type. Layer 28: retrieves the correct factual answer. Abstraction accumulates with depth.