Compiler Pipeline

compilerscompilerhow compilers worklexerparserAST

Neural Download

Installing mental model for compilers.

You type x equals one plus two. To you, that's a simple assignment. To your computer, it's just a string of characters — x, space, equals, space, one, space, plus, space, two. Raw text. No meaning at all.

The compiler's first job is to make sense of these characters. And it does that with a lexer — sometimes called a tokenizer.

The lexer reads your source code character by character, and groups them into tokens. Think of it like a sorting machine on a conveyor belt. Characters flow in, and out come labeled chunks.

The x becomes a token of type IDENTIFIER. The equals sign becomes ASSIGNMENT. The one becomes INTEGER LITERAL. Plus becomes OPERATOR. Two becomes another INTEGER LITERAL.

Whitespace? Comments? Gone. The lexer strips them out because they mean nothing to the computer. They're for humans only.

What you're left with is a clean stream of tokens — the vocabulary of your program. But here's the thing: the lexer doesn't understand what these tokens mean together. It knows x is an identifier and one is a number. But it has no idea that x equals one plus two is an assignment statement. That meaning — the structure — comes next.

The lexer gave us a flat list of tokens. But code isn't flat — it has structure. One plus two has to happen before the result gets assigned to x. There are rules about what can follow what, what's nested inside what. The parser's job is to discover that structure.

It reads the token stream and builds a tree. Not just any tree — the Abstract Syntax Tree. The AST.

Watch. The parser sees: identifier x, assignment, integer one, operator plus, integer two. And it builds this.

At the top — an assignment node. Its left child is x — the target. Its right child isn't just two — it's an addition node. And THAT node has two children: one and two.

The tree captures exactly how the expression is structured. One plus two is grouped together because addition has to happen first. Then the result flows up to the assignment. The tree IS the meaning of your code.

And here's where it gets powerful. What about something like x equals one plus two times three? Without the tree, that's ambiguous. But the parser knows that multiplication binds tighter than addition. So it builds the tree with times at the bottom — two times three happens first, then plus one. Operator precedence isn't a rule you memorize. It's baked into the shape of the tree.

Every program you've ever written becomes one of these trees. Your entire codebase — every function, every loop, every condition — is a tree of trees.

The parser built us a beautiful tree. But the tree only captures structure — it doesn't check if the code actually makes sense.

Consider this: x equals hello plus three. The parser will happily build a tree for that. Assignment node, addition node, string, integer. Structurally, it's fine. But you can't add a string to an integer. That's semantic nonsense.

This is where the semantic analyzer walks the tree and asks the hard questions. Does this variable exist? Has it been declared? When you call a function with three arguments, does that function actually take three? When you add two things, are they types that can be added?

It walks the tree node by node. First: scope resolution. It builds a symbol table — a map of every variable and function, where they're declared, and where they're visible. If you use a variable that doesn't exist, this is where the compiler catches it.

Next: type checking. The analyzer looks at every operation and checks if the types are compatible. Integer plus integer? Fine. String plus integer? Error. And some languages go further — the analyzer can infer types you didn't even write, filling in the blanks automatically.

By the end, the tree is annotated — every node stamped with its type, every variable linked to its declaration. The tree now isn't just structure. It's verified structure. And that matters, because the next stage is going to transform it — and it needs to trust that everything is correct.

The front end understood your code. The back end's job is to make it fast. And this is where compilers become genuinely magical.

The verified tree gets lowered into an intermediate representation — IR. Think of it as a simplified, machine-neutral version of your code. Not quite assembly, not quite source code. A middle language.

And now the optimizer goes to work.

First: constant folding. If your code says x equals two plus three, the optimizer doesn't wait until runtime. It computes two plus three right now, at compile time, and replaces the whole expression with five. The addition just vanishes.

Next: dead code elimination. If there's an if-false block — code that can never run — the optimizer doesn't just skip it. It deletes it entirely. Gone. The binary gets smaller and the CPU never even sees it.

Then: function inlining. If you call a tiny function a million times, the overhead of the call itself — pushing arguments, jumping, returning — can cost more than the actual computation. So the optimizer copies the function's body directly into the call site. No more call overhead. The function dissolves into the code that uses it.

Each of these passes runs over the IR, transforms it, and passes the result to the next optimization. Some compilers run dozens of these passes. LLVM — the backend used by Clang, Rust, and Swift — has over two hundred optimization passes.

The code that comes out of this stage is often radically different from what you wrote. Shorter. Faster. But semantically identical. Same behavior, fraction of the cost.

The IR is optimized. Now comes the final translation — turning those abstract operations into actual machine instructions that the CPU can execute. This is code generation.

Remember our x equals one plus two? In the IR, it's something like: load the value one into a temporary, load two into another temporary, add them, store the result. Clean, abstract operations.

The code generator maps each of these to real CPU instructions. Load becomes MOV — move the value one into register R-one. The second load becomes MOV into R-two. Add becomes ADD R-one comma R-two — the CPU physically adds the values in those registers. Store becomes MOV from the register into memory at x's address.

But it's not just a one-to-one translation. The code generator has to make hard decisions. Which registers to use? There are only a handful — maybe sixteen general-purpose registers on your CPU. If your program uses a hundred variables, the code generator has to decide which ones live in fast registers and which get spilled to slower memory. This is register allocation — one of the hardest problems in compiler design.

And at the very end, those instructions get encoded into binary — the actual ones and zeros. Machine code. The thing the CPU reads.

So here's the full picture. You typed x equals one plus two. The lexer chopped it into tokens. The parser built a tree. The semantic analyzer checked the types. The optimizer folded the constants and eliminated the waste. And the code generator turned it into machine instructions.

Six stages. One line of code. And every program you run went through all of them.

Cognitive architecture... updated.