Voxtral Transcribe 2

Open weights, closed doors? Voxtral’s new transcriber ignites paywall beef and benchmark brawls

TLDR: Voxtral launched two speech-to-text models, including an open-weights realtime version and a cheap $0.003/min API. Commenters love the price and privacy angle but flame a “pay-to-try” studio, missing realtime speaker labels, and a lack of comparisons to Whisper—turning a tech drop into a community slap-fight.

Voxtral just dropped Transcribe 2—two speech-to-text models that promise speedy captions, speaker labels, and rock-bottom prices. The headline grabber: Voxtral Realtime is open-weights under Apache 2.0, streams in near real time, and speaks 13 languages. The batch model, Mini Transcribe V2, brags $0.003 per minute and low word errors (think “how often it mishears you”). On paper, it’s a win for wallets and privacy.

But the comments? Chaos and comedy. One user got hyped for “native diarization” (that’s the “who said what” labels), then slammed the brakes: no diarization in Realtime, only in the batch model—cue the “open weights, closed features” memes, and a link to the 9GB model on Hugging Face. Another torched the “Click to try!” promo that allegedly leads to a paywall, fuming that “try” actually means “buy.” Meanwhile, benchmark detectives demanded receipts: why compare against GPT-4o mini and not the fan-favorite open model Whisper Large v3?

Still, number-crunchers cheered the price war, pointing at Amazon’s $0.024/min vs Voxtral’s $0.003/min (source). Self-hosters loved the open weights for private, on-device use; subscription skeptics joked about “10-year plans” and “free as in free-ish.” Translation: it’s half standing ovation, half side-eye—and 100% drama.

Key Points

  • Voxtral Transcribe 2 introduces Voxtral Mini Transcribe V2 (batch) and Voxtral Realtime (live) with state-of-the-art accuracy, diarization, and multilingual support.
  • Voxtral Realtime uses a novel streaming architecture offering latency configurable down to sub-200ms; at 2.4s delay it matches the batch model, and at 480ms it stays within 1–2% WER.
  • Voxtral Mini Transcribe V2 reports ~4% WER on FLEURS and pricing at $0.003/min, claiming best price-performance among transcription APIs.
  • Mini V2 outperforms GPT-4o mini Transcribe, Gemini 2.5 Flash, Assembly Universal, and Deepgram Nova on accuracy; it is ~3× faster than ElevenLabs’ Scribe v2 at one-fifth the cost.
  • An audio playground in Mistral Studio supports diarization, timestamp control, context biasing, and common audio formats up to 1GB per file.

Hottest takes

“or not, no diarization in real-time” — observationist
“So, you don’t mean ‘try this out,’ you mean ‘buy this product’” — serf
“There’s no comparison to Whisper Large v3” — mdrzn
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.