Voxtral Transcribe 2

Open weights, closed doors? Voxtral’s new transcriber ignites paywall beef and benchmark brawls

TLDR: Voxtral launched two speech-to-text models, including an open-weights realtime version and a cheap $0.003/min API. Commenters love the price and privacy angle but flame a “pay-to-try” studio, missing realtime speaker labels, and a lack of comparisons to Whisper—turning a tech drop into a community slap-fight.

Voxtral just dropped Transcribe 2—two speech-to-text models that promise speedy captions, speaker labels, and rock-bottom prices. The headline grabber: Voxtral Realtime is open-weights under Apache 2.0, streams in near real time, and speaks 13 languages. The batch model, Mini Transcribe V2, brags $0.003 per minute and low word errors (think “how often it mishears you”). On paper, it’s a win for wallets and privacy.

But the comments? Chaos and comedy. One user got hyped for “native diarization” (that’s the “who said what” labels), then slammed the brakes: no diarization in Realtime, only in the batch model—cue the “open weights, closed features” memes, and a link to the 9GB model on Hugging Face. Another torched the “Click to try!” promo that allegedly leads to a paywall, fuming that “try” actually means “buy.” Meanwhile, benchmark detectives demanded receipts: why compare against GPT-4o mini and not the fan-favorite open model Whisper Large v3?

Still, number-crunchers cheered the price war, pointing at Amazon’s $0.024/min vs Voxtral’s $0.003/min (source). Self-hosters loved the open weights for private, on-device use; subscription skeptics joked about “10-year plans” and “free as in free-ish.” Translation: it’s half standing ovation, half side-eye—and 100% drama.

Key Points

•Voxtral Transcribe 2 introduces Voxtral Mini Transcribe V2 (batch) and Voxtral Realtime (live) with state-of-the-art accuracy, diarization, and multilingual support.
•Voxtral Realtime uses a novel streaming architecture offering latency configurable down to sub-200ms; at 2.4s delay it matches the batch model, and at 480ms it stays within 1–2% WER.
•Voxtral Mini Transcribe V2 reports ~4% WER on FLEURS and pricing at $0.003/min, claiming best price-performance among transcription APIs.
•Mini V2 outperforms GPT-4o mini Transcribe, Gemini 2.5 Flash, Assembly Universal, and Deepgram Nova on accuracy; it is ~3× faster than ElevenLabs’ Scribe v2 at one-fifth the cost.
•An audio playground in Mistral Studio supports diarization, timestamp control, context biasing, and common audio formats up to 1GB per file.

Hottest takes

“or not, no diarization in real-time” — observationist

“So, you don’t mean ‘try this out,’ you mean ‘buy this product’” — serf

“There’s no comparison to Whisper Large v3” — mdrzn

February 4, 2026

Open source or open sore?

Open weights, closed doors? Voxtral’s new transcriber ignites paywall beef and benchmark brawls

Key Points

Hottest takes

February 4, 2026

Open source or open sore?

Voxtral Transcribe 2

Open weights, closed doors? Voxtral’s new transcriber ignites paywall beef and benchmark brawls

Key Points

Hottest takes

Save News