CODA: Rewriting Transformer Blocks as GEMM-Epilogue Programs

Researchers say they found a clever speed trick, but the comments say “we’ve seen this movie”

TLDR: CODA claims it can speed up AI training by reorganizing work so less time is wasted moving data around on the chip. Commenters weren’t fully dazzled: some said the trick isn’t new, and the real story is that this setup may be easier for AI tools to write.

A new paper called CODA says it can make training big artificial intelligence models faster by keeping more work on the graphics card instead of constantly sending data back and forth to memory. In plain English: the authors are trying to cut down on wasteful shuffling so the chip spends more time actually doing useful math. That sounds impressive, and some commenters were genuinely excited by one line in particular: large language models writing these fast kernels themselves. One reaction basically boiled down to, if the bots can help build the speedups, progress might suddenly move a lot faster.

But the real action was in the comment section, where the mood was less “jaw dropped” and more “hang on, isn’t this just an old trick with a fresh label?” The sharpest pushback came from people saying CODA doesn’t unlock some magical new performance boost so much as package existing ideas in a way that’s friendlier for machine-generated code. That turned the conversation into a spicy little identity crisis: is this a breakthrough in speed, or a breakthrough in how we describe the speed trick so an AI can assemble it?

And yes, the nerd humor arrived right on schedule. One commenter joked that people seeing this paper were getting major “second kernel” energy, as if veteran chip programmers were watching a remake and recognizing every plot beat. Another user dropped a chaotic mini-summary so dense it read like the community’s version of a detective board covered in red string. In other words: the paper brought optimization news, but the comments delivered the drama, skepticism, and memes.

Key Points

  • The article identifies memory-bound operators around dense linear algebra as a significant source of end-to-end runtime in Transformer training.
  • CODA is introduced as a GPU kernel abstraction that expresses these operators as GEMM-plus-epilogue programs.
  • The approach relies on algebraically reparameterizing many separate Transformer framework kernels so they can execute while GEMM output tiles remain on chip.
  • CODA fixes the GEMM mainloop and exposes composable epilogue primitives for scaling, reductions, pairwise transformations, and accumulation.
  • The article reports high performance for both human-authored and LLM-authored CODA kernels across representative Transformer workloads.

Hottest takes

"doesn’t enable any performance that Triton couldn’t already achieve" — rahen
"LLMs can successfully author CODA kernels" — maxignol
"Getting a lot of \"GEMM epilogue fusion\" vibes from this" — saagarjha
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.