The path to ubiquitous AI (17k tokens/sec)

Instant AI at home? Community cheers speed, asks if it can scale

TLDR: Taalas unveiled a custom chip that runs a Llama 3.1 8B model at near-instant speeds, promising cheaper, faster AI at home. The community is thrilled by the speed but split on whether a fixed, small model can scale, upgrade, and compete with big cloud systems—potentially a game-changer.

Meet Taalas: the startup claiming it can turn any AI into a custom chip in two months, then blast out replies at up to 17k tokens per second (think: word-chunks so fast it feels instant). Their first “hard-wired” Llama 3.1 8B chip skips fancy cooling and merges memory with compute for speed and lower cost. The pitch: less data-center dystopia, more kitchen-counter AI. The vibe: the crowd is dazzled—and already arguing.

One early tester gasped that replies arrived “instantly,” while another quipped we’re “one step closer” to buying a box of AIs on AliExpress. Meanwhile, skeptics are side-eyeing the fine print: an 8B model is fast and useful, but is it tiny compared to big cloud brains? And if the chip is “weights as ROM” (basically fixed to one model), what happens when the next hot model drops? The community split forms: speed-drunk fans want home coding boxes and “frontier model” chips, while pragmatists warn about scaling, flexibility, and upgrade pain. The memes are flowing—“instant noodles, instant AI,” “GPU who?”—but the debate is real: are we seeing the smartphone moment of AI hardware, or just a flashy pit stop on the road to bigger, smarter models?

Key Points

  • Taalas claims to convert previously unseen AI models into custom silicon in about two months.
  • The company’s per‑model “Hardcore Models” aim for order‑of‑magnitude gains in speed, cost, and power versus software implementations.
  • Core principles: total specialization, merging storage and computation on-chip at DRAM-level density, and radical simplification of the hardware stack.
  • Taalas says its architecture removes reliance on HBM, advanced packaging, 3D stacking, high-speed I/O, and liquid cooling, reducing system cost and complexity.
  • First product announced: HC1 hard-wired with Llama 3.1 8B; the title references 17k tokens/sec performance.

Hottest takes

“one step closer to being able to purchase a box of llms on aliexpress” — baq
“jarring to see a large response come back instantly” — metabrew
“some kind of weights as ROM type of thing?” — impossiblefork
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.