March 7, 2026

Bots pull all-nighters, charts pull punches

Autoresearch: Agents researching on single-GPU nanochat training automatically

AI intern runs 5‑minute night experiments; fans cheer, skeptics cry “just knob‑twiddling”

TLDR: A tiny repo lets an AI agent auto‑tune a small chatbot with 5‑minute runs on one GPU; a fix now helps non‑H100 cards. Commenters are split between “cool robot intern” and “just hyperparameter fiddling,” with extra drama over chart scales and whether this beats traditional tuning methods.

Call it the robot intern era: a tiny AI agent tweaks a small chatbot model, runs 5‑minute sprints on one graphics card, keeps what helps, dumps what doesn’t, and leaves you a morning report on GitHub. It’s simple, self‑contained, and even ships under the MIT License. But the comments? Absolute theater.

One camp is cackling at the absurd future: as falcor84 joked, we’re a patch away from agents publishing and peer‑reviewing themselves. Others just want receipts. Abeppu grilled the premise: if the “improvements” are mostly knob‑twists like learning rates, is this any better than classic tuning—and is the AI just making random changes? Meanwhile, kubb rolled in with the roast of the day: why burn expensive Claude tokens to squeeze tiny boosts from a tiny model?

Then came the chart drama. Lostmsu called out a non‑zero y‑axis that “makes it look very successful,” prompting a chorus of “show the full scale.” Hardware folks had their own subplot: Flash Attention 3 crashing on non‑Hopper GPUs got a fix merged, and the tool now plays nicer beyond NVIDIA H100. Another issue flagged the performance math being hardcoded to H100 numbers. TL;DR: half the crowd sees a fun, democratized lab assistant; the other half sees overhype, cherry‑picked charts, and a weekend toy with vibes of “autotune, but for models.”

Key Points

  • Autonomous agent iteratively edits a single training file (train.py) and runs 5‑minute experiments, evaluating validation bits per byte (val_bpb).
  • Project uses a minimal single‑GPU nanochat implementation with a GPT model and Muon + AdamW optimizers; only PyTorch and a few small packages are required.
  • Quick start requires a single NVIDIA GPU (tested on H100), Python 3.10+, and uv; data prep trains a BPE tokenizer before running experiments.
  • Design choices include a fixed time budget, a single file for edits, a vocab-size‑independent BPB metric, and a self-contained setup without distributed training.
  • A PR adding a fallback Flash Attention 3 kernel for non‑Hopper GPUs was merged; related crash issue was closed, while MFU calculation issues remain open.

Hottest takes

"burning Claude tokens to slightly improve his tiny and not very capable LLM?" — kubb
"Non-zero based chart makes it look like it was very successful." — lostmsu
"better or worse ... than hyperparameter tuning techniques that don't involve an LLM?" — abeppu
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.