HipKittens: Fast and furious AMD kernels

AMD’s comeback kitten claws at Nvidia as commenters hiss, purr, and panic

TLDR: HipKittens debuts fast AMD GPU kernels to unlock performance. Comments split between hope (georgehotz: AMD finally works) and doubt (“never misses an opportunity to miss an opportunity”), plus crash horror stories and missing Nvidia B300 quibbles—sign that the Nvidia-vs-AMD balance could actually shift.

HipKittens, a new set of super-fast AMD GPU kernels from Stanford’s HazyResearch, just tossed red meat into the GPU cage fight. The pitch: AMD’s latest chips look scary-powerful on paper, but clunky software keeps that muscle locked up. HipKittens promises speed and simpler building blocks, calling out flaky libraries, compiler slowdowns, and even “undocumented” hardware quirks. Translation: they want AMD cards to stop feeling like second-class citizens. The crowd noticed. Some are ready to ditch green for red; others rolled their eyes and stocked popcorn.

Enter the comments, claws out. One pragmatist snapped that “AMD never misses an opportunity to miss an opportunity,” demanding AMD actually fund efforts like HipKittens and the paper. A Gentoo veteran shared battle scars about AMD’s “composable-kernel” nuking their PC with out-of-memory crashes—peak Linux drama. Then celebrity hacker [georgehotz] pulled up to say things are finally working: install PyTorch with ROCm (AMD’s toolkit) and go—“The MI350X machine is stable.” Meanwhile, finance brains questioned whether Nvidia’s sky-high value still makes sense if fewer AI models dominate, and a drive-by heckler complained the team “totally ignored B300.” The vibe: cautious hype, spicy skepticism, and way too many cat jokes. HipKittens is selling speed; the comments are selling a reality check—and a redemption arc.

Key Points

  • HipKittens is released as state-of-the-art AMD GPU kernels with programming primitives to ease AMD kernel development.
  • AMD MI355X OAM shows strong hardware specs versus NVIDIA B200 SXM5 (higher compute in several precisions, larger memory capacity, equal bandwidth).
  • Existing AMD software (CK, AITER, PyTorch, compilers) often fails to reach peak performance; SDPA Llama GQA backwards achieves only 30% (AITER) and 24% (PyTorch) of SoTA on MI355X.
  • Undocumented hardware behavior around bank conflict avoidance in the CDNA ISA is identified as a key issue.
  • Compiler observations: Mojo’s MHA forwards hits ~50% of peak due to bank conflicts; TileLang is limited to CDNA3 and its MHA kernel is only competitive with PyTorch.

Hottest takes

"AMD never misses an opportunity to miss an opportunity" — wewewedxfgdf
"Things are turning around for AMD... The MI350X machine is stable" — georgehotz
"unrecoverable OOMs in my Gentoo system" — LtdJorge
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.