Paper Tape Is All You Need – Training a Transformer on a 1976 Minicomputer

Internet loses it over AI on a ’70s machine—nostalgia, shade, and trackpoint worship

TLDR: A programmer trained a simple AI on a 1970s PDP-11 to reverse digits, proving clever tuning can beat brute-force hardware. Commenters debated whether modern AI needs massive compute, joked about trackpoints, and dropped “Turing Machine” one‑liners—turning retro nostalgia into a sharp reality check on today’s AI hype.

A programmer just trained a tiny Transformer (an AI that learns patterns) on a 1976 PDP-11 minicomputer to reverse digits—and the comment section lit up like a retro arcade. One camp is screaming, “So do we really need all that modern mega-compute?” with one poster asking, “exactly how true?” while another snarked that the real flex is simply having a PDP-11 running in 2026. The project uses hand-tuned settings instead of fancy modern optimizers and fits in the kind of memory your smartwatch would laugh at. Cue debates about whether this is genius perspective or just a clever party trick.

The nostalgia wave was strong: people fawned over the author’s 20-year-old “modern” laptop with a beloved concave trackpoint, turning the thread into a surprise keyboard cult rally. Meanwhile, a zinger stole the show: “So, really, a Turing Machine is all you need?” which had the crowd riffing on minimalism and the old-school roots of computing. The author jumped in, all chill and helpful, offering to field questions about fixed-point math and hardware quirks, winning points for transparency.

Is this going to replace your phone’s AI? No. But as a time-traveling thought experiment, it’s pure gold—and yes, Tensor2Tensor fans chimed in to note the reversal task is a legit benchmark. Retro meets reality, and the comments did not disappoint.

Key Points

  • ATTN/11 implements a single-layer, single-head transformer in PDP-11 assembly to reverse 8-digit sequences.
  • The encoder-only architecture uses embeddings, self-attention with residual, output projection, and softmax; no layer norm, FFN, or decoder.
  • Model size: d_model=16, vocab=10 (digits), sequence length=8, totaling 1,216 parameters.
  • Initial Fortran IV training with uniform LR took ~1,500 steps (≈6.5 hours on hardware); per-layer LRs cut this to ~600 steps (~2.5 hours) with SGD.
  • Bare-metal assembly and fixed-point NN11 enable fitting in 32KB (vs 64KB) and a compact 6,179-byte binary; Adam was avoided due to memory/compute costs.

Hottest takes

I find that more impressive than the program — AnimalMuppet
With a concave trackpoint, respect — kristopolous
So, really, a Turing Machine is all you need? — kmoser
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.