TinyTinyTPU: 2×2 systolic-array TPU-style matrix-multiply unit deployed on FPGA

Baby TPU on a Lego-like chip sparks Bitcoin déjà vu

TLDR: A miniature, educational TPU ran on a cheap reprogrammable chip and actually does full AI math, sparking big talk. Commenters split between “this is the next Bitcoin-style hardware rush to custom chips” and “but how do I program it,” with Nvidia’s margins fueling ASIC gold-rush predictions.

A tiny teaching chip that mimics Google’s Tensor Processing Unit just ran on a reprogrammable “Lego” chip called an FPGA, and the comments went nuclear. The project, TinyTinyTPU, squeezes a baby 2×2 math engine into a hobby board and still does the whole AI inferencing dance—multiply, add, activate, normalize, and even quantize. Readers dubbed it the “Baby Yoda of TPUs” and the “Raspberry Pi of AI,” while the non-nerds asked, in unison: so… can I actually use this thing?

Cue the drama. One camp is yelling “hardware gold rush,” with aunty_helen pointing to Nvidia’s fat margins and predicting a wave of custom ASICs (single-purpose chips), just like the Bitcoin saga that raced from GPUs to FPGAs to ASICs. mrinterweb leaned hard into the crypto déjà vu, while others debated whether the average person will ever buy a “home inference box.” Meanwhile, hinkley spun a brainy hot take: use AI as a smart guesser to kick off expensive tasks early—like speculative execution for real life. On the ground, the practical crowd (fooblaster) just wants to know how to program it, and power users (ph4evers) asked if it could run JAX code from Python. The consensus? It’s tiny, it’s real, and it’s lighting up the age-old fight: scrappy DIY chips vs. the coming ASIC overlords—and everyone wants in on the punchline

Key Points

  • TinyTinyTPU implements a 2×2 systolic-array TPU-style matrix-multiply unit in SystemVerilog and runs on a Basys3 (Xilinx Artix‑7) FPGA.
  • It includes a full post-MAC pipeline—accumulator, activation, normalization, quantization—and a UART-based host interface for multi-layer MLP inference.
  • Resource usage on XC7A35T is ~1,000 LUTs, ~1,000 flip‑flops, 8 DSP48E1 slices, and ~10–15 BRAM blocks (≈25,000 gates).
  • The project provides open-source (Yosys/nextpnr) and vendor (Vivado) build flows, plus automated TCL scripts to generate basys3_top.bit in 5–10 minutes.
  • A comprehensive test suite using Verilator, Python, and cocotb covers modules and end-to-end integration, with waveform viewing via GTKWave/Surfer.

Hottest takes

“Nvidia’s 70% margins are too hard to ignore” — aunty_helen
“This reminds me of bitcoin mining” — mrinterweb
“A cross between Bloom Filters and speculative execution” — hinkley
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.