March 20, 2026
GPUs go brrr—or do they?
Mamba-3
Faster AI or just fancy words? Commenters split as Mamba-3 touts speed without the jargon
TLDR: Mamba-3 promises quicker replies than Mamba-2 and a small Transformer, and the team even released the GPU code. Commenters are split between plain‑English hype, skeptics arguing batching cancels the gains, and benchmarkers eager to test—setting up a latency vs throughput showdown everyone cares about.
Mamba-3 shows up yelling one thing: make it fast to run. The team says their new design—smarter “memory” updates, math that tracks complex signals, and a multi-lane input/output trick—lets a 1.5B model beat both its Mamba-2 predecessor and even a tiny Transformer on response time. They even open‑sourced the GPU code. Translation: the same or better answers, delivered quicker.
But the comments are the real show. One top voice, robofanatic, begs for plain English and fewer buzzwords, essentially asking, “just say it runs faster.” Another camp, led by jychang, throws a red flag: more number‑crunching per token only helps when you’re serving one user at a time; “no provider does batch=1 inference,” they argue. That sets off the classic brawl: latency lovers vs throughput chasers.
Meanwhile, nl brings popcorn and benchmarks, saying they’re “looking forward to comparing this to Inception 2”—yes, even text diffusion models are entering the chat. The memes rolled in too: “GPUs go brrr” vs “GPUs go blur” if you’re just shuffling memory.
Bottom line? Fans cheer a speed bump for real‑world use, skeptics say the math may not matter at provider scale, and the rest want fewer acronyms and more receipts.
Key Points
- •Mamba-3 is an inference-first state space model introducing a more expressive recurrence, complex-valued state tracking, and a MIMO variant.
- •At 1.5B parameters, Mamba-3 SISO outperforms Mamba-2, Gated DeltaNet, and Llama-3.2-1B on combined prefill+decode latency across sequence lengths.
- •The team open-sourced high-performance kernels built with Triton, TileLang, and CuTe DSL.
- •The work is a collaboration among Carnegie Mellon University, Princeton University, Cartesia AI, and Together AI, cross-posted on the Goomba Lab blog.
- •The article frames a shift from training-focused to inference-centric design, citing RLVR and agentic workflows as drivers of inference demand.