Launch HN: IonRouter (YC W26) – High-throughput, low-cost inference

Blazing fast or more of the same? Users fight over price, privacy, and OpenRouter

TLDR: IonRouter launched claiming super-fast, low-cost AI inference via a custom engine on Nvidia hardware. The crowd’s split: some like the speed and use cases, but many question pricing clarity, competitiveness versus OpenRouter/Fireworks, and demand a strong privacy policy—making trust and value the real battleground.

IonRouter burst onto Hacker News bragging big speed—its custom "IonAttention" engine claims to juggle multiple AI models on one Nvidia Grace Hopper chip and pump out text fast, with one benchmark showing 7,167 tokens per second versus ~3,000 from a top provider. But the crowd didn’t just clap; they reached for popcorn. First skirmish: pricing clarity. reactordev joked they “panicked” at the phrase "per token" until it was clarified as per million tokens, while GodelNumbering begged the team to put the Models & Pricing front and center: people want costs before hype.

Second skirmish: why pick IonRouter at all? Oras questioned why anyone would choose a single provider when OpenRouter already lets you use the same models from multiple vendors. IonRouter’s pitch—bring your custom models and get no cold starts, real-time scaling—met a chorus of "cool... but what’s the moat?"

Third skirmish: is it actually competitive? nylonstrung called it “trailing the pareto frontier” for speed and price, even after marketplace fees, while others lobbed privacy grenades: erichocean said without a Google Vertex-level policy, it’s a "non-starter."

Still, fans are curious about the deep dive and the flashy use cases—robots, surveillance, and AI video pipelines. The vibe? Speed flex vs wallet check, with a side of "show us the receipts."

Key Points

  • IonRouter introduces the IonAttention engine to multiplex multiple models on a single GPU with millisecond swaps and real-time traffic adaptation, optimized for NVIDIA Grace Hopper (GH200).
  • A single GH200 achieves 7,167 tok/s on Qwen2.5-7B with IonAttention, compared to ~3,000 tok/s from an unnamed top provider, per the site’s benchmark.
  • The platform supports custom and open-source models (including LoRAs) with dedicated GPU streams, no cold starts, and per-second billing.
  • Use cases highlighted include robotics perception, multi-camera surveillance, game asset generation, and AI video pipelines.
  • Pricing is per million tokens with no idle costs, listing models like GLM-5, Kimi-K2.5, MiniMax-M2.5, Qwen3.5-122B-A10B, and GPT-OSS-120B with stated throughput and rates.

Hottest takes

“Man you had me panicking there for a second. Per token?!?” — reactordev
“One thing I don’t get is why would anyone use a direct service that does the same thing as others when there are services such as openrouter” — Oras
“it seems like this is trailing the pareto frontier in cost and speed.” — nylonstrung
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.