Mercury 2: The fastest reasoning LLM, powered by diffusion

Mercury 2 promises instant AI replies — but can anyone feel the speed

TLDR: Mercury 2 says it’s the fastest reasoning AI, using a parallel “diffusion” approach to blast out answers at over 1,000 tokens per second. The community is split: some cheer real-time agents and voice apps, while others demand proof that speed improves outcomes and gripe the demo feels stuck in a queue.

Mercury 2 just crashed the party claiming the world’s fastest “thinking” AI, boasting over 1,000 words-per-second style output and a new diffusion trick that drafts and refines in parallel instead of pecking one word at a time. Translation: less typewriter, more instant editor. The company touts real-time voice agents, coding assistants, and jam-free workflows — plus wallet-friendly pricing — but the crowd isn’t buying pure hype without receipts. The hottest take? “Intelligence per second.” One commenter wants a scoreboard that blends smarts and speed, turning model choices into a race, not just a beauty contest. Another is thrilled this could enable multi-shot prompts and gentle “nudges” without users noticing, potentially taming those weird AI hallucinations. But the vibe gets messy: skeptics ask what actually changes beyond shorter waits, especially for coding where quality still bottlenecks outcomes. And the demo? A queue turned into the villain. Users joked you can’t feel “instant” if you’re stuck in line — fastest model, slowest lobby. Memes flew: “Fast enough to beat autocorrect,” “Speedrun to wrong answers,” and “The world’s quickest ‘hold please.’” Meanwhile, fans say voice interfaces and multi-step AI agents are where speed becomes magic. The verdict: big claims, bigger expectations, and the community wants proof under load, not press quotes.

Key Points

  • Inception launched Mercury 2, a diffusion-powered reasoning LLM using parallel refinement instead of autoregressive decoding.
  • The model claims >5x faster generation and achieves 1,009 tokens/sec on NVIDIA Blackwell GPUs.
  • Pricing is $0.25 per 1M input tokens and $0.75 per 1M output tokens.
  • Features include tunable reasoning, 128K context, native tool use, and schema-aligned JSON output, with emphasis on low p95 latency under high concurrency.
  • Use cases span developer tools, agentic workflows, advertising optimization, transcript cleanup, HCI, and voice interfaces; partners report significant latency gains.

Hottest takes

“multi-shot prompting (+ nudging) and the user doesn't even feel it” — dvt
“You can't actually tell that it is fast at all” — ilaksh
“metric of intelligence per second” — cjbarber
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.