CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication Through RL

AI-tuned code claims to beat NVIDIA’s best—commenters cry foul

TLDR: An AI-tuned system claims faster matrix math than NVIDIA’s go-to library on A100 GPUs, but commenters challenge the comparisons, the “we discovered it” branding, and a confusing speedup chart. Hype meets scrutiny as readers demand fair tests, clear visuals, and proof it works beyond one chip.

CUDA-L2 swaggered onto the stage claiming its AI + reinforcement learning (think: trial-and-error bot) found faster ways to multiply big number grids—telling the world it beats NVIDIA’s famed cuBLAS on an A100 GPU across 1,000 sizes. The devs dropped A100-only, half-precision (FP16) kernels, with more GPUs and 32-bit accuracy modes “coming soon.” But the internet didn’t throw confetti—it threw questions.

Skeptics pounced on fairness: one top comment wonders if the system only supports FP16 while comparing against an FP32 (fuller-precision) solver—aka, apples vs oranges. Others bristled at the paper’s claim that the algorithm “discovered” new tricks. One commenter called it “literature laundering,” suggesting the method may remix known techniques rather than invent fresh ones—citations please. And the chart? Chaos. Readers expected raw performance, but instead got “speedup over X,” which makes big bars mean “faster than,” while 0% equals “just as fast.” Cue confusion and side-eye.

Meanwhile, the practical crowd groaned at the setup: the exact CUTLASS version, environment variables, silent failures—the perfect recipe for forum meltdown. A100-only support sparked more grumbles: will this help on 3090s or H100s? “Maybe, not guaranteed,” says the FAQ. And in pure meme energy, one reply just posted “-4 -4 -4 -4 -4,” instantly turning into a downvote drumbeat joke. The mood: excitement meets skepticism, chart wars, and a whole lot of “show me the receipts.”

Key Points

  • CUDA-L2 uses LLMs and reinforcement learning to optimize FP16 HGEMM CUDA kernels.
  • The project reports outperforming torch.matmul and NVIDIA cuBLAS/cuBLASLt variants across 1,000 (M,N,K) cases on A100.
  • As of Dec 2, 2025, A100-optimized HGEMM kernels for 1,000 configurations are released, currently with 16-bit accumulators.
  • Planned updates include 32-bit accumulators, denser configuration coverage, support for Ada Lovelace/Hopper/Blackwell GPUs, and easier deployment for open-source LLMs.
  • Setup requires PyTorch 2.6.0+, NVIDIA CUTLASS v4.2.1, correct environment variables, and provides an evaluation script for offline/server modes.

Hottest takes

"only support FP16... compares... against an FP32 solver?" — stonogo
"It smells like it could be 'laundering' the literature" — j2kun
"the bar chart effectively inverts the performance" — alyxya
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.