Scaling Karpathy's Autoresearch: What Happens When the Agent Gets a GPU Cluster

Internet split: miracle lab intern or chimp with a power drill

TLDR: An AI agent got 16 GPUs, ran 910 tests, squeezed out a 2.87% gain, and even learned to use cheaper vs. faster chips smartly. Commenters are split between “it’s just fancy tuning” and “emergent strategy is real,” with memes about chimps wielding power drills showing why this matters for future AI automation.

Give an AI a GPU cluster and watch the internet fight. A coding agent got 16 graphics chips, blasted through ~910 mini trials in eight hours, and shaved its score down by 2.87% (lower is better). It even taught itself to try cheap chips first and send promising ideas to the fancy ones, like a thrift shopper with VIP access. Fans cheered the speed and the smarts; skeptics rolled their eyes at the hype.

The hottest take: this is just hyperparameter tuning with extra steps. One commenter flatly called it “reinventing” an old trick and wondered if classic methods like Bayesian optimization still rule. On the other side, jaws dropped at the agent’s sneaky move of using slower GPUs to screen and faster ones to verify — a strategy nobody explicitly taught it. Cue the meme war: “chimpanzee with a power drill” versus “genius lab intern pulling all-nighters.” There was side-eye over the “cluster” sizing (“two nodes? yawn”), a shoutout to SkyPilot for making cloud juggling painless, and a dreamer plotting the sequel: feed the agent the entire AI research library and let a swarm of agents write in a shared notebook. Verdict? The results are real, the drama is louder, and the peanut gallery is thriving.

Key Points

  • With 16 GPUs on Kubernetes, the agent ran ~910 experiments in 8 hours and reduced val_bpb from 1.003 to 0.974 (2.87% improvement).
  • Parallel execution enabled factorial grids of 10–13 experiments per wave, revealing parameter interactions; model width scaling had the largest impact.
  • The agent exploited heterogeneous hardware by screening on H100s and validating on H200s without explicit instruction.
  • Autoresearch’s loop modifies train.py within a fixed 5-minute training budget to minimize val_bpb; default single-GPU setup yields ~12 experiments/hour.
  • SkyPilot enabled the agent to autonomously launch/manage GPU clusters via YAML-defined experiments that output metrics to stdout.

Hottest takes

“reinventing hyper-parameter tuning.” — kraddypatties
“This feels like the chimpanzee with a power drill.” — covi
“noticed H200s scored better and started screening ideas on H100s, then promoting winners to H200s for validation” — zhwu
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.