July 3, 2026
Chip fight night 🍿
GLM5.2 on AMD MI355X at 2626 tok/s/node at over 2x lower cost than Blackwell
AMD’s cheaper AI chips just crashed Nvidia’s party — and the comments got spicy
TLDR: Wafer says AMD’s MI355X can run a major AI model at about 80% of Nvidia Blackwell’s speed for less than half the cost, which matters because Nvidia hardware is pricey and scarce. The comments instantly turned into a fight over missing power-use data, benchmark spin, and whether AMD is finally becoming a real threat.
A hardware brag post turned into a full-on comment-section cage match after Wafer claimed it got GLM5.2 running on AMD chips at 2626 tokens per second per node and more than 2x cheaper than Nvidia Blackwell. In plain English: they’re saying AMD can do a very similar AI job for a lot less money, at a time when Nvidia gear is expensive, hard to find, and treated like concert tickets for data centers. The crowd immediately split into two camps: “huge win for AMD” and “okay, but show us the real numbers.”
The most practical voices weren’t even arguing brand loyalty — they wanted boring grown-up details like electricity use, supply, and profit margins. One commenter basically said, “Cool chart, now show performance per watt,” especially for companies outside the U.S. that can’t get enough Nvidia chips anyway. Another sniffed around the business model and asked if providers are secretly running 80%+ margins or if poor usage rates are eating the profits. Translation: the audience smelled a bigger industry story hiding behind the benchmark victory lap.
And then came the nerd drama. One commenter slammed the headline vibes by pointing out that the flashy 2600 tok/s number is aggregate, not what a single user sees. Another dropped a galaxy-brain hot take that Blackwell is already old news because Rubin is supposedly the real inference king. Meanwhile, one of the more optimistic comments sounded like a movie trailer for AI coding agents: they could unlock all the ignored chip architectures engineers never had time to optimize. So yes, this was a performance post — but the comments turned it into a referendum on hype, honesty, power bills, and whether Nvidia’s moat is finally looking a little less scary.
Key Points
- •Wafer reports serving GLM5.2 on AMD Instinct MI355X at 2,626 aggregate tokens per second per node on a 20k input / 1k output workload with a 60% cache hit rate and up to 2.4 requests per second.
- •The article says this MI355X result was about 80% of the performance measured on an NVIDIA B200 while costing more than 2x less.
- •Wafer also reports 213 tokens per second single stream on a 10k input / 1.5k output benchmark following Artificial Analysis standards, using TensorWave MI355X capacity.
- •For model preparation, Wafer quantized bf16 GLM-5.2 to MXFP4 with AMD Quark and compared it with z-ai’s FP8 quantization using GSM8K, GPQA-Diamond, and tau2 benchmarks.
- •Wafer selected sglang over vLLM and ATOM, then fixed ROCm/MTP quantization issues related to bf16 shared experts and layer-name mismatches to enable speculative decoding support.