I tried Karpathy's Autoresearch on an old research project

He did laundry, the bot did science — and the comments are on fire

TLDR: A researcher let an AI agent auto-tune an old project and it improved results by 54% while fixing hidden bugs. Comments split between hype and skepticism: fans call it proof that “AI interns” work, critics say it’s just fast tweaking, and many worry about cost and whether it generalizes beyond chatbots.

While a researcher folded laundry, an AI assistant ran rapid-fire experiments on an old project and slashed the score by 54%—from a so-so result to something much better—by tweaking one file on repeat. The setup was locked down in a sandbox, the runs were short (about five minutes), and the bot even uncovered bugs the human didn’t know existed. Cue the comment section turning into a street fight.

One camp is cheering. “So… it did work,” says one user, reveling in the bug-fixing, score-boosting montage. Another crowd is side-eyeing the hype: “More engineering than research,” a top comment sniffs, saying the whole thing feels like super-charged Kaggle—aka competitive data tinkering—not groundbreaking science. The budget-watchers pile on too: one veteran warns that 90% of AI suggestions are junk and actually trying them all could torch your wallet, while still admitting the 10% gold can be brilliant. Practical folks ask the real question: can this work beyond chatbots—like upgrading a medical image model or a basic U-Net? Meanwhile, someone dropped an archive link because of course the original post was getting hammered.

The vibe: AI intern does the grunt work, humans argue about credit. Is this the future of research—or just turbocharged parameter futzing with great marketing?

Key Points

  • The author applied Karpathy’s Autoresearch to an old eCLIP-based project using Claude Code as an LLM-driven iterative optimizer.
  • The loop edited train.py based on program.md and scratchpad.md, running ~5-minute experiments with commit/revert decisions.
  • Sandboxing included containerization, restricted permissions, and initially no network access; later, web access was allowed in a final exploratory phase.
  • A new dataset (Ukiyo-eVG) with phrase-to-bounding-box annotations was used; boxes were converted to Gaussian heatmaps as an additional model input.
  • Across 42 experiments on an RTX 4090 in one day, mean rank on a 1K-image test set improved from 344.68 to 157.43 (~54% reduction), with 13 commits and 29 reverts.

Hottest takes

“So… It did work.” — love2read
“About 90% … is useless … the other 10% is nice.” — datsci_est_2015
“More engineering than research IMO” — lamroger
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.