March 23, 2026
Laundry day, AI slay
I tried Karpathy's Autoresearch on an old research project
He did laundry, the bot did science — and the comments are on fire
TLDR: A researcher let an AI agent auto-tune an old project and it improved results by 54% while fixing hidden bugs. Comments split between hype and skepticism: fans call it proof that “AI interns” work, critics say it’s just fast tweaking, and many worry about cost and whether it generalizes beyond chatbots.
While a researcher folded laundry, an AI assistant ran rapid-fire experiments on an old project and slashed the score by 54%—from a so-so result to something much better—by tweaking one file on repeat. The setup was locked down in a sandbox, the runs were short (about five minutes), and the bot even uncovered bugs the human didn’t know existed. Cue the comment section turning into a street fight.
One camp is cheering. “So… it did work,” says one user, reveling in the bug-fixing, score-boosting montage. Another crowd is side-eyeing the hype: “More engineering than research,” a top comment sniffs, saying the whole thing feels like super-charged Kaggle—aka competitive data tinkering—not groundbreaking science. The budget-watchers pile on too: one veteran warns that 90% of AI suggestions are junk and actually trying them all could torch your wallet, while still admitting the 10% gold can be brilliant. Practical folks ask the real question: can this work beyond chatbots—like upgrading a medical image model or a basic U-Net? Meanwhile, someone dropped an archive link because of course the original post was getting hammered.
The vibe: AI intern does the grunt work, humans argue about credit. Is this the future of research—or just turbocharged parameter futzing with great marketing?
Key Points
- •The author applied Karpathy’s Autoresearch to an old eCLIP-based project using Claude Code as an LLM-driven iterative optimizer.
- •The loop edited train.py based on program.md and scratchpad.md, running ~5-minute experiments with commit/revert decisions.
- •Sandboxing included containerization, restricted permissions, and initially no network access; later, web access was allowed in a final exploratory phase.
- •A new dataset (Ukiyo-eVG) with phrase-to-bounding-box annotations was used; boxes were converted to Gaussian heatmaps as an additional model input.
- •Across 42 experiments on an RTX 4090 in one day, mean rank on a 1K-image test set improved from 344.68 to 157.43 (~54% reduction), with 13 commits and 29 reverts.