A CPU that runs entirely on GPU

Your graphics card just learned to think — and the comments are losing it

TLDR: A mini “computer” now runs entirely on a graphics card using neural networks for every math step. The crowd is split: some praise the 100% correct results and wild twist that multiply beats add, while others demand speed comparisons and wonder what this means for regular CPUs.

A developer just built a “computer brain” that lives entirely on your graphics card, using little AI models for every math move — add, multiply, even sine and cosine. No normal calculator steps, just trained networks and tensors. The kicker? Multiplication runs 12x faster than addition, flipping what everyone knows about computers and sending the comments into a comedy‑panic.

Fans like RagnarD are thrilled this proves neural models can do precise math, while skeptics immediately ask the practical question: is this way slower than a real CPU? One commenter, sudo_cowsay, gawks at the upside‑down speed chart and jokingly wonders if the regular CPU should start packing its desk. A semantics brawl breaks out when Surac declares, “GPUs are just special‑purpose CPUs,” prompting replies about whether this is a science project or the future of hardware. Then bmc7505 drops a prophecy bomb: “As foretold six years ago,” linking to a vision from breandan.net.

Memes fly: “GPU got promoted to CFO (Chief Fastest Operator),” “CPU applying for unemployment,” and “Carry‑lookahead is a vibe.” Whether it’s genius or gimmick, the vibe is clear — people love the audacity, and they want receipts on speed and real‑world use before crowning it the new king of compute.

Key Points

  • A CPU implemented entirely on GPU uses PyTorch tensors for registers, memory, flags, and the program counter, with all ALU ops executed by trained neural networks.
  • Instruction implementations map to specific neural models (e.g., Kogge-Stone CLA for add/sub/cmp, byte-pair LUT for mul, attention-based shifts, neural truth tables for bitwise ops).
  • Integer arithmetic reports 100% accuracy verified by 347 automated tests; 23 models (~135 MB) are trained, with 13 wired for execution, including math functions and an LLM-based decoder.
  • Benchmarks on Apple Silicon (PyTorch 2.10.0, MPS backend) show per-op latencies from ~21 µs (mul, exp/log, bitwise) to ~935 µs (atan2), with ~60 ms model load time and ~4,975 IPS.
  • Key findings: multiplication is ~12× faster than addition due to LUT parallelism; Kogge-Stone carry-lookahead reduces add/sub/cmp latency ~3.3× compared to neural ripple-carry, demonstrating classical hardware principles applied to neural models.

Hottest takes

"precise math in an LLM is important" — RagnarD
"how much slower is this" — lorenzohess
"Multiplication is 12x faster than addition..." — sudo_cowsay
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.