May 11, 2026
Fast code, hotter comments
Training an LLM in Swift, Part 1: Taking matrix mult from Gflop/s to Tflop/s
Swift tries to outrun C, and the comments are absolutely eating it up
TLDR: A developer hand-tuned Swift code to speed up AI training on Apple hardware, aiming for eye-popping gains without relying on big software libraries. Readers loved the ambition, argued over what counts as “real” speed, and instantly turned the comment section into a nerdy showdown over benchmarks, hidden hardware, and risky compiler tricks.
A Swift developer decided to do something delightfully chaotic: build the number-crunching heart of an AI model by hand, skip the usual machine-learning toolkits, and then try to make it faster than C, the old-school language many programmers treat like sacred scripture. The project is about speeding up matrix multiplication — basically the repeated arithmetic that powers AI training — on Apple chips, from the regular processor to the graphics hardware. But the real party is in the replies, where readers are treating this like a comeback tour, a hardware gossip thread, and a tiny programming flame war all at once.
The biggest mood? Respect mixed with nerdy disbelief. One commenter called the piece “phenomenal,” while others loved seeing serious Swift performance writing in a world where that kind of deep-dive is rare. There’s also a big nostalgia wave: longtime readers were thrilled to see Cocoa with Love creator Matt Gallagher still dropping dense, high-quality posts like it’s a greatest-hits album. But of course, this is the internet, so the applause quickly turned into debate. One crowd jumped on the author’s claim of about 1.1 trillion calculations per second and basically said, “Cute number, but real graphics-chip speed is way messier than the marketing stickers suggest.” Another commenter zoomed in on a compiler setting and delivered the classic programmer scolding: yes, AI math can get away with looser rules, but please do not normalize dangerous shortcuts for everyone else. Even the semi-secret Apple math hardware got dragged into the drama, with commenters wondering whether the company is hiding special tricks in plain sight. In other words: one man optimized Swift, and the community turned it into a full-on performance, purity, and prestige discourse.
Key Points
- •The article focuses on optimizing handwritten matrix multiplication in Swift for LLM training on Apple Silicon.
- •It is the first part of a series about training neural networks in Swift and comparing Apple Silicon compute units such as CPU, SIMD, AMX, and GPU.
- •The project deliberately avoids machine learning libraries and frameworks, even though the author acknowledges established frameworks are the practical choice for production matrix multiplication.
- •The sample app is intended to use the matrix kernels inside a full LLM implementation, with performance measured over complete forward and backward training iterations.
- •The Swift implementation is based on a rewrite of Andrej Karpathy's llm.c, a plain C GPT-2-compatible reference implementation.