November 28, 2025

Talk of the town or copycat sound?

Open (Apache 2.0) TTS model for streaming conversational audio in realtime

Open-source talker drops—fans cheer, skeptics yell “Kyutai clone”

TLDR: Dia2 is a new open-source tool that speaks in real time, with free models and a two-minute English cap. The community is split between cheering the release and calling it a Kyutai-style clone, demanding clearer technical differences while still piling into the demo to judge for themselves.

Dia2 just hit the stage as an Apache-licensed voice generator that can start talking as soon as you type a few words—and the crowd is split. On paper it’s spicy: real-time chats, English only, roughly two-minute limit, plus free model files in two sizes with a quick demo on Hugging Face. But the comments? That’s where it’s boiling.

The top drama: originality. One camp is side-eyeing Nari Labs for saying they were “inspired by” other projects without spelling out the differences. “Vibe-coded clone” became the phrase of the day, as users point out Dia2 leans on the same Mimi audio codec and similar building blocks they associate with Kyutai’s work. Another commenter flatly declares it “very similar,” stoking the copycat cries. Cue the memes: “Kyutai cosplay,” “We have Kyutai at home,” and “Depformer? Déjà-vu-former.”

Meanwhile, the hype squad is excited it’s open, fast, and streamy. They’re into the idea of conditioning on example voices for steadier results, even if quality varies generation to generation. The pragmatic crowd shrugs: two-minute cap is fine for assistants, and the strict “don’t deepfake” disclaimer gets a thumbs-up—if also a few snarky “good luck policing that” replies. Bottom line: Dia2 may talk in real time, but the comments are screaming in real time too, demanding clearer tech receipts while still crowding the demo link to hear it speak.

Key Points

  • Nari Labs released Dia2, an open-source (Apache 2.0) streaming dialogue TTS model for real-time conversational audio.
  • Model checkpoints (1B and 2B) and inference code are available, with English-only output capped at about two minutes.
  • Audio conditioning is supported, and conditional generation using prefix audio (transcribed by Whisper) is recommended for stability.
  • Setup requires uv and CUDA 12.8+ drivers; CLI auto-selects CUDA/CPU and defaults to bfloat16, with options for CUDA Graph.
  • Upcoming components include a JAX-based Bonsai implementation, a Dia2 TTS Server for real streaming, and Sori, a Rust speech-to-speech engine.

Hottest takes

“vibe-coded clones to get some publicity” — ks2048
“Looks very similar to Kyutai’s models” — woodson
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.