February 4, 2026
Open source or open sore?
Voxtral Transcribe 2
Open weights, closed doors? Voxtral’s new transcriber ignites paywall beef and benchmark brawls
TLDR: Voxtral launched two speech-to-text models, including an open-weights realtime version and a cheap $0.003/min API. Commenters love the price and privacy angle but flame a “pay-to-try” studio, missing realtime speaker labels, and a lack of comparisons to Whisper—turning a tech drop into a community slap-fight.
Voxtral just dropped Transcribe 2—two speech-to-text models that promise speedy captions, speaker labels, and rock-bottom prices. The headline grabber: Voxtral Realtime is open-weights under Apache 2.0, streams in near real time, and speaks 13 languages. The batch model, Mini Transcribe V2, brags $0.003 per minute and low word errors (think “how often it mishears you”). On paper, it’s a win for wallets and privacy.
But the comments? Chaos and comedy. One user got hyped for “native diarization” (that’s the “who said what” labels), then slammed the brakes: no diarization in Realtime, only in the batch model—cue the “open weights, closed features” memes, and a link to the 9GB model on Hugging Face. Another torched the “Click to try!” promo that allegedly leads to a paywall, fuming that “try” actually means “buy.” Meanwhile, benchmark detectives demanded receipts: why compare against GPT-4o mini and not the fan-favorite open model Whisper Large v3?
Still, number-crunchers cheered the price war, pointing at Amazon’s $0.024/min vs Voxtral’s $0.003/min (source). Self-hosters loved the open weights for private, on-device use; subscription skeptics joked about “10-year plans” and “free as in free-ish.” Translation: it’s half standing ovation, half side-eye—and 100% drama.
Key Points
- •Voxtral Transcribe 2 introduces Voxtral Mini Transcribe V2 (batch) and Voxtral Realtime (live) with state-of-the-art accuracy, diarization, and multilingual support.
- •Voxtral Realtime uses a novel streaming architecture offering latency configurable down to sub-200ms; at 2.4s delay it matches the batch model, and at 480ms it stays within 1–2% WER.
- •Voxtral Mini Transcribe V2 reports ~4% WER on FLEURS and pricing at $0.003/min, claiming best price-performance among transcription APIs.
- •Mini V2 outperforms GPT-4o mini Transcribe, Gemini 2.5 Flash, Assembly Universal, and Deepgram Nova on accuracy; it is ~3× faster than ElevenLabs’ Scribe v2 at one-fifth the cost.
- •An audio playground in Mistral Studio supports diarization, timestamp control, context biasing, and common audio formats up to 1GB per file.