Demystifying ARM SME to Optimize General Matrix Multiplications

Apple M4 math boost has fans cheering, skeptics yelling ‘BLIS or it didn’t happen’

TLDR: An open-source library speeds up matrix math on Apple’s M4 by about 23% using ARM’s new matrix hardware. Excited fans call it a step away from Nvidia’s grip, while skeptics demand BLIS comparisons and flag odd vector slowdowns—turning a solid win into a lively benchmark brawl.

Move over, math class—MpGEMM just squeezed extra speed out of Apple’s M4 by tapping ARM’s new matrix hardware, SME. The headline: about 1.23x faster than Apple’s own Accelerate library on real AI jobs like DeepSeek and LLaMA. Impressive? Yes. But the comments instantly turned into a jury trial.

The loudest charge: show us BLIS. As bee_rider notes, the paper skipped comparing against BLIS, a favorite math library, and that has folks side‑eyeing the benchmark table. Is MpGEMM a genuine leap or a carefully staged flex? Meanwhile, anematode adds nuance: SME slaps for pure matrix math, but Apple’s streaming vector mode flops for other tasks—so there’s still performance weirdness lurking under the hood.

Then come the curveballs and memes. Archit3ch asks if it supports sparse LU solves (the nerdy equivalent of “does it run Doom?”), while starkeeper declares this the sword to slay the “nvidia monster” and reclaim precious DRAM from GPU land. Translation: big speedup, bigger debate. The crowd is split between “finally, CPU renaissance!” and “cool, but where’s the BLIS showdown?”—making MpGEMM the latest lightning rod where math benchmarks, silicon quirks, and platform wars collide.

Key Points

•MpGEMM is an open-source GEMM library optimized for ARM’s Scalable Matrix Extension (SME).
•The design follows guidelines from a systematic characterization of SME’s architectural features.
•Key techniques include cache-aware partitioning, on-the-fly transposition during data packing, and SME-focused micro-kernels using multi-vector loads and tile registers.
•MpGEMM supports multiple numerical precisions to suit varied workloads.
•On an Apple M4 Pro, MpGEMM averages 1.23× faster than Apple’s Accelerate library and outperforms other open-source alternatives on DeepSeek and LLaMA workloads.

Hottest takes

"I don’t get why they didn’t compare against BLIS." — bee_rider

"my attempts at using the SSVE extension for vector math were an utter failure for performance" — anematode

"This will save us from the nvidia monster!" — starkeeper

January 31, 2026

M4 math mayhem, comment chaos

Apple M4 math boost has fans cheering, skeptics yelling ‘BLIS or it didn’t happen’

TLDR: An open-source library speeds up matrix math on Apple’s M4 by about 23% using ARM’s new matrix hardware. Excited fans call it a step away from Nvidia’s grip, while skeptics demand BLIS comparisons and flag odd vector slowdowns—turning a solid win into a lively benchmark brawl.

Key Points

Hottest takes

January 31, 2026

M4 math mayhem, comment chaos

Demystifying ARM SME to Optimize General Matrix Multiplications

Apple M4 math boost has fans cheering, skeptics yelling ‘BLIS or it didn’t happen’

TLDR: An open-source library speeds up matrix math on Apple’s M4 by about 23% using ARM’s new matrix hardware. Excited fans call it a step away from Nvidia’s grip, while skeptics demand BLIS comparisons and flag odd vector slowdowns—turning a solid win into a lively benchmark brawl.

Key Points

Hottest takes

Save News