March 28, 2026
Tape the hype
Paper Tape Is All You Need – Training a Transformer on a 1976 Minicomputer
Internet loses it over AI on a ’70s machine—nostalgia, shade, and trackpoint worship
TLDR: A programmer trained a simple AI on a 1970s PDP-11 to reverse digits, proving clever tuning can beat brute-force hardware. Commenters debated whether modern AI needs massive compute, joked about trackpoints, and dropped “Turing Machine” one‑liners—turning retro nostalgia into a sharp reality check on today’s AI hype.
A programmer just trained a tiny Transformer (an AI that learns patterns) on a 1976 PDP-11 minicomputer to reverse digits—and the comment section lit up like a retro arcade. One camp is screaming, “So do we really need all that modern mega-compute?” with one poster asking, “exactly how true?” while another snarked that the real flex is simply having a PDP-11 running in 2026. The project uses hand-tuned settings instead of fancy modern optimizers and fits in the kind of memory your smartwatch would laugh at. Cue debates about whether this is genius perspective or just a clever party trick.
The nostalgia wave was strong: people fawned over the author’s 20-year-old “modern” laptop with a beloved concave trackpoint, turning the thread into a surprise keyboard cult rally. Meanwhile, a zinger stole the show: “So, really, a Turing Machine is all you need?” which had the crowd riffing on minimalism and the old-school roots of computing. The author jumped in, all chill and helpful, offering to field questions about fixed-point math and hardware quirks, winning points for transparency.
Is this going to replace your phone’s AI? No. But as a time-traveling thought experiment, it’s pure gold—and yes, Tensor2Tensor fans chimed in to note the reversal task is a legit benchmark. Retro meets reality, and the comments did not disappoint.
Key Points
- •ATTN/11 implements a single-layer, single-head transformer in PDP-11 assembly to reverse 8-digit sequences.
- •The encoder-only architecture uses embeddings, self-attention with residual, output projection, and softmax; no layer norm, FFN, or decoder.
- •Model size: d_model=16, vocab=10 (digits), sequence length=8, totaling 1,216 parameters.
- •Initial Fortran IV training with uniform LR took ~1,500 steps (≈6.5 hours on hardware); per-layer LRs cut this to ~600 steps (~2.5 hours) with SGD.
- •Bare-metal assembly and fixed-point NN11 enable fitting in 32KB (vs 64KB) and a compact 6,179-byte binary; Adam was avoided due to memory/compute costs.