Developing a BLAS Library for the AMD AI Engine [pdf]

AMD’s AI Engine gets roasted: slower than CPUs, “dead end” vibes

TLDR: A master’s thesis adds a math library to AMD’s AI Engine to make it easier to use, but early results sparked backlash over slower-than-CPU performance. The community split: some blame missing compiler optimizations, others say the hardware and tooling are a dead end, urging AMD to stick to GPUs.

A student thesis builds a math toolkit (BLAS) for AMD’s/Xilinx’s AI Engine—the specialized chip inside the VCK5000 FPGA board—so people can chain operations without learning esoteric hardware code. That’s the nice part. The comments? Pure theater. The loudest take: “It’s called an AI Engine but it’s slower than a CPU—so what’s the point?” Critics asked if this thing helps anyone beyond tiny models and power nerds. One hardware whisperer fired back: software pipelining (overlapping work in loops) was missing, which on this chip can deliver a 5–10x speedup, so hold the pitchforks—this might be a compiler/directives problem, not a doomed design. Another voice tried to cool heads by noting, “Folks, this is BLAS on an FPGA board,” while skeptics escalated: AMD will ditch this and focus on GPUs, citing scattered toolchains and thin software support. The meme brigade jumped in with zingers like “AI stands for Actually Inefficient” and “Pipeline? More like pipe dream.” Fans of the thesis applauded the effort to make weird hardware usable; cynics called it a showcase of why open, mature tooling matters. The split: is the AI Engine hamstrung by bad tools—or just fundamentally not worth the hassle?

Key Points

  • The thesis introduces aieblas, a BLAS library for the AMD/Xilinx AI Engine (AIE) that compiles chained BLAS routines into full dataflow designs.
  • Spatial dataflow architectures are highlighted for compile-time control to reduce CPU-like control overhead but are hard to program for HPC due to device-specific languages and limited high-level libraries.
  • The library features an expandable core enabling addition of new operations and optimizations, with principles applicable to other spatial dataflow architectures.
  • Design and implementation cover mapping BLAS routines to AIE kernels, user configuration, code and graph generation, kernel placement, PL kernel generation, and build/system configuration (including CMake).
  • Performance is evaluated against OpenBLAS and several integrated optimizations are assessed; the thesis concludes aieblas eases programming of AMD/Xilinx AI Engines without deep low-level model knowledge.

Hottest takes

“It’s called an ‘AI Engine’, but it’s slower than CPU?” — kouteiheika
“With software pipelining, they could have 5–10x speed up” — titanix88
“This architecture is likely a dead end for AMD” — fooblaster
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.