How Transformers See Every Word at Once
Self-attention, query-key-value, feed-forward networks, and generation — the full transformer architecture, animated step by step.
Neural Download
Installing mental model for transformer.
Every time you type into ChatGPT, a machine reads your entire message and writes back, word by word. That machine is called a transformer. And right now, it's the most important architecture in all of AI.
But to understand why it works, you need to see what came before — and why it broke.
Before transformers, language models worked like a game of telephone. Each word got whispered to the next. By word fifty, the original message was a ghost. Information decayed with every step.
Worse — it was painfully slow. Each word had to wait for the previous one. A thousand words meant a thousand sequential steps, while thousands of GPU cores sat idle.
In 2017, a team at Google asked: what if we threw away the chain entirely? What if every word could talk to every other word, all at once?
Not a whisper chain. A group chat. That idea changed everything.
Here's how the group chat works. It starts with a simple idea: turn each word into numbers — a vector. Direction means meaning. Similar words point in similar directions.
And these vectors carry real relationships. King minus man, plus woman, lands near queen. The geometry of meaning.
Now — here's where self-attention begins. Each word creates three versions of itself. A query — what am I looking for? A key — what do I have to offer? And a value — here's my actual information. These are the core trainable parameters of the transformer. Three matrices, learned from data.
The query of one word dot-products with the key of every other word. High score means strong connection. Low score means irrelevant. You now have a compatibility score between every pair of words.
But there's a problem. When vectors are long, the variance grows. Softmax would give all attention to a single word. So we scale — divide by the square root of the key dimension. A volume knob. Turn it, and the distribution smooths out. Multiple words contribute instead of just one.
After scaling, softmax converts scores into weights. Each row sums to one. A probability distribution over every word in the sentence.
Finally — multiply those weights by the values. Each word gets a weighted sum of every other word's information. Words that matter contribute more. Words that don't fade out.
And when you zoom out? For a single head — an attention map. Every word's relationship to every other word, visible all at once. The cat sat on the mat — watch how sat lights up connections to cat and mat. This is the transformer seeing language.
But one attention pattern isn't enough. Language has many kinds of relationships happening at once. Subject and verb. Adjective and noun. A pronoun and the thing it refers to.
So the transformer doesn't run one attention — it runs several in parallel. In the original paper, eight heads, each learning to spot a different pattern. One might track grammar. Another tracks meaning. Another finds long-range references.
Each head learns its own projections from the full embedding. All heads run simultaneously, then their results get concatenated and projected back together. Multiple perspectives, merged into one.
But attention is only half the story. And this is the part most explanations skip.
After attention, every token passes through a feed-forward network. This is the hidden giant of the transformer — often holding the majority of a model's parameters. Attention decides what to look at. The feed-forward network processes what you found.
Think of it as a private consultation. Each token enters a room, queries a memory bank with thousands of slots, and leaves enriched with knowledge. Research suggests this is one of the key places the model stores learned facts.
Now, wrapping both of these — attention and the feed-forward network — are residual connections. This is the design choice that makes everything else possible.
Every block doesn't replace the input. It adds to it. The original signal flows on a highway that runs through every layer. Each layer is an on-ramp — it contributes new information, but the original is preserved.
This means gradients during training flow backward on that same highway, staying strong across dozens of layers. The residual stream is why transformers can go deep without the signal collapsing.
There's one more thing attention can't do on its own. It treats words as a set — no sense of order. "Dog bites man" and "man bites dog" look the same.
The fix: before anything else, positional information gets added to each embedding. The original paper used overlapping sine waves — each position gets a unique signature. Modern models often learn these positions directly. Either way, the model knows where each word sits.
Now let's put it all together.
A sentence enters the transformer. Each word becomes an embedding, gets its positional information, then flows through a stack of identical blocks — attention, normalize, feed-forward, normalize — picking up more understanding at every layer.
At the end, the final vector predicts the next word.
And this is where it gets real. This is what ChatGPT is doing right now, every time it writes a response.
It starts with your prompt. Runs it through every layer. Predicts the first word. Then takes your prompt plus that new word, and runs it again to predict the second. Then again. And again. Each new word can only see what came before it — that's autoregressive modeling.
But here's the clever part — it doesn't recompute everything from scratch. The keys and values from previous words get cached. Each new word runs through the layers, but looks up the cached keys and values instead of recomputing them. This is the KV cache — and it's what makes generation practical.
Token by token, the sentence builds. Each word informed by every word that came before it, through the exact same attention mechanism we just watched.
That's the transformer. The architecture behind nearly every large language model, and most AI tools you've used this year. Not magic — machinery. And now you've seen how it works.
Cognitive architecture... updated.
