How Taalas "prints" LLM onto a chip?

Fans cheer “game cartridge AI” while skeptics shout “show us proof”

TLDR: Taalas claims a custom chip with an AI “printed” into it that’s far faster and cheaper than GPUs, igniting huge buzz. The crowd is split between giddy dreams of swappable AI cartridges and hard-nosed skeptics demanding proof about one-transistor math, wiring complexity, and whether this is hype or breakthrough.

Taalas says it literally “prints” an AI model onto a custom chip — and the internet promptly split into two camps: the wow crowd and the wait-a-sec crowd. The startup claims its fixed-function ASIC runs Llama 3.1 8B at a blistering 17,000 tokens per second and is 10x cheaper, faster, and greener than a GPU. The twist: the model’s weights are etched into silicon, data streams straight through, and there’s no off-chip memory traffic. A tiny on-chip SRAM handles the working memory and adapters. Sounds like breaking the “memory wall”? Cue drama.

Engineers raised eyebrows at the “single-transistor multiply” claim. One commenter wondered if it’s analog math in disguise and warned it might be “too noisy,” while others asked if wiring such a tightly connected network is even practical. Meanwhile, the meme machine went wild: fans pictured swapping models like Nintendo DS game carts, and another compared it to a built-in “AI core” like hardware video codecs (H.264, AV1). There was even meta-snark that the post “doesn’t answer the question.” Supporters hailed a retro-futurist comeback for structured chips; skeptics warned about costs, upgrade lag, and bold marketing. Verdict from the comments section gladiator pit: thrilling if true, but show the receipts, please.

Key Points

  • Taalas introduced a fixed‑function ASIC that hardwires Llama 3.1 8B’s weights, reporting ~17,000 tokens/sec.
  • The company claims ~10× lower total ownership cost, ~10× lower power, and ~10× faster inference than GPU-based systems.
  • The chip streams data through 32 etched layers, avoiding external DRAM/HBM; on‑chip SRAM holds KV cache and LoRA adapters.
  • A claimed single‑transistor 4‑bit multiply scheme performs in‑situ multiply, minimizing memory traffic.
  • Taalas maps models by customizing only the top two masks of a base chip; Llama 3.1 8B mapping reportedly took two months.

Hottest takes

“too noisy and error prone to work” — rustyhancock
“Imagine a slot on your computer where you physically pop out and replace the chip” — owenpalmer
“Note that this doesn’t answer the question in the title” — rustybolt
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.