I tried Karpathy's Autoresearch on an old research project

He did laundry, the bot did science — and the comments are on fire

TLDR: A researcher let an AI agent auto-tune an old project and it improved results by 54% while fixing hidden bugs. Comments split between hype and skepticism: fans call it proof that “AI interns” work, critics say it’s just fast tweaking, and many worry about cost and whether it generalizes beyond chatbots.

While a researcher folded laundry, an AI assistant ran rapid-fire experiments on an old project and slashed the score by 54%—from a so-so result to something much better—by tweaking one file on repeat. The setup was locked down in a sandbox, the runs were short (about five minutes), and the bot even uncovered bugs the human didn’t know existed. Cue the comment section turning into a street fight.

One camp is cheering. “So… it did work,” says one user, reveling in the bug-fixing, score-boosting montage. Another crowd is side-eyeing the hype: “More engineering than research,” a top comment sniffs, saying the whole thing feels like super-charged Kaggle—aka competitive data tinkering—not groundbreaking science. The budget-watchers pile on too: one veteran warns that 90% of AI suggestions are junk and actually trying them all could torch your wallet, while still admitting the 10% gold can be brilliant. Practical folks ask the real question: can this work beyond chatbots—like upgrading a medical image model or a basic U-Net? Meanwhile, someone dropped an archive link because of course the original post was getting hammered.

The vibe: AI intern does the grunt work, humans argue about credit. Is this the future of research—or just turbocharged parameter futzing with great marketing?

Key Points

•The author applied Karpathy’s Autoresearch to an old eCLIP-based project using Claude Code as an LLM-driven iterative optimizer.
•The loop edited train.py based on program.md and scratchpad.md, running ~5-minute experiments with commit/revert decisions.
•Sandboxing included containerization, restricted permissions, and initially no network access; later, web access was allowed in a final exploratory phase.
•A new dataset (Ukiyo-eVG) with phrase-to-bounding-box annotations was used; boxes were converted to Gaussian heatmaps as an additional model input.
•Across 42 experiments on an RTX 4090 in one day, mean rank on a 1K-image test set improved from 344.68 to 157.43 (~54% reduction), with 13 commits and 29 reverts.

Hottest takes

“So… It did work.” — love2read

“About 90% … is useless … the other 10% is nice.” — datsci_est_2015

“More engineering than research IMO” — lamroger

March 23, 2026

Laundry day, AI slay

He did laundry, the bot did science — and the comments are on fire

Key Points

Hottest takes

March 23, 2026

Laundry day, AI slay

I tried Karpathy's Autoresearch on an old research project

He did laundry, the bot did science — and the comments are on fire

Key Points

Hottest takes

Save News