Neural DownloadNEURAL DOWNLOAD
← cd ../blog

NumPy: How Python Gets C Speed

NumPy isn't really Python. It's a thin Python layer over a C core. A Python for-loop summing 100 million numbers takes 92 seconds. `np.arange(100_000_000).sum()` finishes in a third of a second — 300× faster, using the same Python you just wrote. Where did the speed come from? Python didn't suddenly get fast. The loop moved into C. → Why a Python list is a pointer chase and an ndarray is a wall of bytes → What a ufunc actually is, and why `arr + arr` becomes ONE C function call, not a million → How NumPy ships hand-tuned SIMD kernels (SSE, AVX, NEON) for every modern CPU → The stride trick: why `arr[::2]` doesn't copy — and why broadcasting is literally `stride = 0` → The "Python mask over C" pattern that powers pandas, PyTorch, scikit-learn By the end, that one-line `.sum()` is the least mysterious thing in your data science stack. Previous video (recommended first): "Why Python Is 100x Slower Than C" Chapters: 0:00 Intro 0:02 The puzzle — one line, 80× faster 0:38 Bytes, not objects — the memory layout trick 1:33 Ufuncs — one call, one C loop 2:37 Strides — slicing without copying 3:30 The Python mask over C 4:19 Outro References: → Harris et al. "Array programming with NumPy" (Nature 2020) — https://www.nature.com/articles/s41586-020-2649-2 → Jake VanderPlas "Losing Your Loops" (PyCon 2015) — https://youtu.be/EEUXKG97YRw → NumPy official internals docs — https://numpy.org/doc/stable/reference/internals.html #Python #NumPy #DataScience

numpy python numpy numpy internals numpy tutorial ndarray numpy broadcasting numpy strides ufuncs numpy ufuncs numpy vs python why numpy is fast python performance simd numpy python c extension python data science numpy explained numpy under the hood python numerical computing pandas pytorch numpy data science python python mask over c learn numpy numpy array memory numpy vectorization cpython internals
Share

One line of Python. One C loop underneath.

Summing 100 million numbers in a Python for-loop takes about 8 seconds. np.arange(100_000_000).sum() does the same work in a tenth of a second. Same Python syntax. 80× faster.

Python didn't suddenly get fast. The loop moved.

Bytes, not objects

Start with a Python list of the numbers 1, 2, 3.

You'd think those numbers live in the list. They don't. The list holds pointers — little arrows that point somewhere else on the heap. Follow an arrow and you land on a full Python integer object: a reference count, a type tag, and finally the actual digits. Twenty-eight bytes for the number 3, on CPython 3.11+.

A million numbers? A million tiny objects. Scattered across the heap. Every element access is a pointer chase.

The NumPy array doesn't play that game. No pointers. No objects. No type tags. Just bytes.

Python list [1, 2, 3]  →  [ptr][ptr][ptr]  →  heap: [PyLong:1] [PyLong:2] [PyLong:3]
NumPy array  [1, 2, 3]  →  [01][02][03]   ← 24 bytes, contiguous, done

Three eight-byte integers. Twenty-four bytes, end to end. Ask for the tenth element — jump ten slots, read eight bytes, done. Ask for the millionth — still one jump. No chasing.

The array isn't a list of Python things. It's a block of raw memory, with a label on top.

Ufuncs — one call, one C loop

Now the trick.

a + b, in Python syntax, looks like one operation. It is. But that one operation has to touch a million elements. Or a billion.

A Python for-loop would round-trip through the interpreter once per number. That's why it's slow.

NumPy doesn't do that. The + operator on an ndarray dispatches to a ufunc — a universal function. A ufunc is a compiled C function. It gets handed two things: the byte blocks, and a count.

for (i = 0; i < n; i++)
    out[i] = a[i] + b[i];

That's the loop. It runs in C, for the entire array, in one function call. The Python interpreter sees one operation. The CPU runs a million adds.

And inside that C loop, there's SIMD. Vector instructions NumPy ships hand-tuned for every modern CPU — SSE, AVX, NEON. One cycle, four adds. Sometimes eight. Sometimes sixteen.

That's where the speed lives. Every arithmetic op, every comparison, every math function in NumPy — it's a ufunc. One call, one C loop, done.

Strides — slicing without copying

Slice a NumPy array. Take every other element: arr[::2]. What got copied?

Nothing.

What you got back looks like an array. It has a shape. A dtype. But it's pointing at the same bytes. It just reads them differently.

That's what strides are. A stride says: to reach the next element, skip this many bytes. A normal array of eight-byte integers has a stride of 8. Element, element, element. A stride of 16? Skip every other one. Same memory, different walk.

Transpose a matrix? Bytes don't move. The strides just swap.

>>> arr = np.array([[1,2,3],[4,5,6]])
>>> arr.strides
(24, 8)
>>> arr.T.strides
(8, 24)     # same bytes. different walk.

Broadcasting uses the same trick. Adding a row to a whole matrix — it looks like the row got copied to every row below. It didn't. Broadcasting sets a stride of zero. Stride zero means: don't advance. Read the same bytes, again and again.

One block of bytes. Many ways to walk it. Still one C loop at the bottom.

The Python mask over C

This is the pattern.

NumPy is a thin Python layer. Syntax for shapes, arithmetic, slicing — all Python-looking. Underneath: a block of bytes, and a table of compiled C functions. You write in Python. The CPU runs in C.

Pandas works the same way — a dataframe is an ndarray with labels on top. Every operation drops into C. PyTorch tensors follow the same playbook: a block of bytes, compiled kernels, C or CUDA underneath. Scikit-learn models wrap NumPy arrays with C kernels on top.

This is why Python won scientific computing. It never had to be fast. The loops Python can't run, Python doesn't run. It hands the bytes to C, and waits.

One line of Python. One C loop underneath. That's the trick.