June 26, 2026

Small compiler, huge comment energy

A Tiny Compiler for Data-Parallel Kernels

A 180-line code toy wowed nerds, but the comments turned into a compiler cage match

TLDR: A developer built a tiny Python tool that rewrites simple loops so computers can process several values at once, making plain code faster in theory. Commenters loved the learning exercise but instantly argued over whether it was a clever demo, an oversimplification, or basically a mini version of existing tools.

A developer showed off a tiny homemade compiler—about 180 lines of Python—that takes an ordinary-looking loop and rewrites it so a computer can chew through several pieces of data at once. In plain English: it’s a small experiment in turning boring, one-by-one code into code that can do the same job in parallel, with extra safety checks so it doesn’t run past the end of a list. Cute, clever, and very much a "look what I built to understand this" project.

But the real fireworks were in the comments, where the community immediately split into "nice educational demo" and "okay, but real compilers do way more than this" camps. RossBencina came in with the classic expert-energy reality check, basically saying: hold on, professional vectorizing tools can analyze trickier loops and aren’t limited to the easy cases shown here. Translation for non-compiler mortals: the demo is neat, but the grown-up version of this problem is much messier.

Then came the helpful flexes. One commenter pointed to Eigen, a popular math library, as if to say, "welcome to the rabbit hole"—while also joking that the reward is outrageously verbose compiler output. And then there was the funniest drive-by summary of all: "essentially this is a mini-ISPC?" That one-liner landed like a meme, instantly reframing the whole post as either a charming tiny reinvention or a pocket-sized tribute act. In other words: everybody agreed it was cool, but the comments made sure nobody got away without a debate.

Key Points

  • The article describes a tiny compiler written in about 180 lines of Python to study how regular loops are lowered into explicit data-parallel form.
  • The compiler transforms scalar `for` loops into `vector_for` loops that execute multiple iterations together using lanes.
  • The lowered representation uses per-lane indices and masks, with masked loads and stores, to handle cases where the loop count is not divisible by the number of lanes.
  • The article distinguishes between uniform values shared by all lanes and varying values that differ per lane.
  • A small AST-walking classifier is presented as the core mechanism for determining how expressions should be handled during lowering.

Hottest takes

"vectorisation algorithms typically analyze loop-carried dependencies" — RossBencina
"Though you can expect very verbose compiler..." — rapatel0
"essentially this is a mini-ISPC?" — mgaunard
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.