AMD GPUs Go Brrr

Stanford drops “HipKittens” to make AMD scream, but fans say the real fix is money, not memes

TLDR: Stanford released HipKittens, a toolkit to squeeze real speed from AMD GPUs, showing the hardware can fly if handled right. The crowd’s verdict: cool research, but the real battle is AMD’s weak software—many demand the company spend big to rival NVIDIA’s polished tools, memes and all.

AMD’s latest chips look like muscle cars, but the internet says they’re still stuck in first gear because of software. Stanford’s Hazy Research just dropped HipKittens — a grab bag of low‑level tricks to make AMD GPUs go fast — with an eye‑catching title: “AMD GPUs go brrr.” The specs are wild (more memory, lots of tiny compute parts), and the team shares the code and an arXiv paper so devs can try it. But the comments? Oh, they’re here for the drama, not the benchmarks.

The top vibe: “Great work, academia — but this is AMD’s job.” One camp cheers the effort and links the HN thread, while the loudest voices roast AMD for not fixing its own software. “Spend billions already!” cries one user, echoing a chorus that says NVIDIA dominates because its tools just work. Another adds that running AI at home is smoother than dealing with data center setups, which stings. Meanwhile, the meme police show up, delighted that a serious lab put “go brr” on a Stanford page. The split is clear: clever research vs. corporate responsibility. The community wants AMD to unlock the power — and open its wallet.

Key Points

  • Hazy Research introduces HipKittens, a set of programming primitives to unlock AMD GPU performance for AI workloads.
  • AMD MI355X architecture is detailed: 256 CUs, four SIMDs per CU, and 64-thread waves, with emphasis on feeding matrix cores.
  • AMD lacks several NVIDIA features (async matmul instructions, register reallocation, tensor memory acceleration, first-class mbarrier) but offers more CUs and larger per-processor registers.
  • AMD’s chiplet design splits 256 CUs into eight XCDs with private L2 caches and an additional LLC between L2 and HBM.
  • Spec comparison shows MI355X leading B200 in several precision modes and memory capacity, with equal memory bandwidth.

Hottest takes

"It's insane to me that AMD is not spending billions and billions trying to fix their software" — colordrops
"I don't get why AMD doesn't solve their own software issues" — DeathArrow
"tickled to see 'go brr' on a website/university like Stanford" — alex1138
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.