Cohere Transcribe: Speech Recognition

Cohere drops open speech‑to‑text; fans hype, skeptics yell “OCR déjà vu”

TLDR: Cohere released an Apache 2.0, downloadable speech‑to‑text model that tops public accuracy charts. Fans love the licensing and performance, while skeptics argue over whether niche speech tools will be crushed by bigger AI models—and whether “open source” means code, weights, or both.

Cohere just launched Transcribe, a speech‑to‑text model that turns audio into words and claims the #1 spot on the Hugging Face leaderboard for accuracy. It’s Apache 2.0 (business‑friendly) and downloadable, with a managed option via Cohere. That’s the headline—but the comments are where it gets spicy.

One camp is cheering. A longtime user gushed about Cohere’s reliability, praising their “crisp, steady” performance—translation: fast and consistent, which matters when your meeting notes and call centers depend on it. Another big win in the thread: licensing. As one popular voice put it, it’s “great” this isn’t restricted to non‑commercial use. Devs heard “Apache 2.0” and started doing cartwheels.

Then came the drama. A skeptic dropped a grenade: what if specialized speech tools get swallowed by bigger, smarter all‑in‑one AIs—just like OCR (reading scanned text) did? Cue a mini culture war: specialists vs. generalists. Meanwhile, a sharp question cut through the party—if it’s “open source,” where’s the source? Is posting model weights enough? The comment section split into dictionary lawyers and pragmatists who just want great transcripts.

There’s also a side‑quest: language coverage. It handles 14 languages today, but Europeans asked how hard it is to add more. TL;DR: top accuracy, permissive license, and a community arguing about the future of speech tech—with memes about “Whisper wars” and “OCR flashbacks” sprinkled in for flavor.

Key Points

  • Cohere released Transcribe, an open-weights ASR model under Apache 2.0, available for download and via Model Vault.
  • The model ranks #1 on Hugging Face’s Open ASR Leaderboard for English with a 5.42% average WER, outperforming several open- and closed-source ASR systems.
  • Architecture: conformer-based encoder with a lightweight Transformer decoder; input is log-Mel spectrograms from audio waveforms.
  • Trained from scratch using supervised cross-entropy on output tokens; designed for production with a manageable inference footprint and serving efficiency.
  • Supports 14 languages across European, APAC, and MENA regions, and aims to meet strict production latency and throughput requirements.

Hottest takes

“My worry is that ASR will end up like OCR.” — dinakernel
“It’s great that this is Apache 2.0 licensed” — simonw
“If this is ‘open source’ is there source code somewhere?” — teach
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.