March 11, 2026
Sleepless bots, restless comments
AutoKernel: Autoresearch for GPU Kernels
Overnight GPU speedups? The internet screams “Google did it first” while benchmark cops swarm
TLDR: AutoKernel lets an AI agent tune GPU code overnight for faster models, promising plug‑and‑play speedups. Commenters are split: some say Google already did this, others demand comparisons to TVM and claim NVIDIA’s CUTLASS is faster, while pragmatists want it aimed at llama.cpp for real-world wins.
AutoKernel promises a dream: give it any PyTorch model, go to bed, and wake up to faster GPU code. It uses an autonomous agent to tweak tiny math programs (“kernels”) in Triton, tests each change for correctness, and repeats all night. The devs boast roughly 40 experiments an hour—320 overnight—with an orchestrator applying “Amdahl’s law” to pick the next win. The repo is live at AutoKernel, and fans love the plug‑and‑play vibe.
But the comments? Spicy. One top voice rolled in with the classic “…and so it begins” and a reminder that Google allegedly shipped something similar (“AlphaEvolve”) ages ago—aka, “Google did it first.” Then the benchmark police arrived: “Compare it to TVM’s Ansor” (another auto-tuner) and, even harsher, one user claimed NVIDIA’s CUTLASS library is 3x faster on big matrix math, throwing shade on the project’s graphs and speed claims. Another commenter praised the tight scope—“keep it Triton”—but was confused by the progress chart and what exactly was being measured.
Meanwhile, the practical crowd had a rallying cry: “Point this at llama.cpp.” That open-source project runs on tons of home hardware and quirky formats—perfect for real-world gains. Between “robot devs will replace kernel gurus,” “show me the benchmarks,” and “make it useful for everyone,” the vibe is peak internet: hype vs. receipts, with a side of Skynet jokes about code that optimizes itself while you sleep.
Key Points
- •AutoKernel autonomously optimizes GPU bottleneck kernels in PyTorch models by extracting and iteratively refining Triton kernels.
- •The system uses a fixed benchmark (bench.py) with five correctness checks and roofline analysis to keep or revert changes.
- •An orchestrator applies Amdahl’s law to prioritize which kernel to optimize next and tracks overall speedup.
- •Setup requires an NVIDIA GPU (H100/A100/RTX 4090 tested), Python 3.10+, and uv; example models include GPT-2, LLaMA, and BERT.
- •Nine core kernel types are supported, each with a PyTorch reference and starter Triton kernel; experiments run ~90 seconds enabling ~320 overnight.