Parakeet.cpp – Parakeet ASR inference in pure C++ with Metal GPU acceleration

C++ speed on Mac mics: fans cheer, rivals boast, and a Qwen vs Parakeet showdown erupts

TLDR: Parakeet.cpp brings blazing local speech-to-text in pure C++ with Apple GPU acceleration, touting fast results and a full feature lineup. Comments erupted into a Parakeet vs Qwen rivalry, GUI wishlists, and “how does it compare?” debates, making speed claims clash with real-world, plug-and-play expectations.

Parakeet.cpp just swooped in with a bold flex: lightning‑fast, no‑Python speech‑to‑text that runs locally and loves Apple’s GPU. The dev claims the tiny C++ engine hits a wild “~27ms” encoder pass for 10 seconds of audio and supports everything from offline captioning to live transcription and even who‑spoke‑when detection. In the comments, creator noahkay13 drops a victory lap — “runs 7 model families” — and the local‑first crowd goes feral for the “just C++” purity. Bye bloated runtimes, hello Metal magic.

But the thread quickly morphs into feature‑creep meets turf war. One camp wants polish: ghostpepper parachutes in with a shiny GUI plug for Scriberr, basically saying, “cool engine, where’s my big green button?” Another camp wants receipts: nullandvoid, impressed with other Parakeet setups on Windows and Mac, asks the eternal question — “Hoe does this compare?” — and the typo becomes an instant meme about lawn‑care benchmarks.

Then a boss enters: antirez (yes, that antirez) links qwen-asr and voxtral.c, bragging Qwen can transcribe live radio on “any random laptop.” Suddenly it’s Parakeet vs Qwen, Metal vs everywhere, benchmarks vs “does it actually help me?” The vibe: massive hype for speed and simplicity, spicy rivalry over real‑world wins, and a chorus of devs asking for clean, plug‑and‑play tools — with jokes flying faster than the transcripts.

Key Points

  • Parakeet.cpp provides pure C++ inference for NVIDIA Parakeet ASR models using the Axiom tensor library with automatic Apple Metal GPU acceleration.
  • Reported performance includes ~27 ms encoder inference for 10 seconds of audio on Apple Silicon (110M model), claimed as 96x faster than CPU.
  • Supported models cover offline (TDT-CTC 110M English, TDT 600M multilingual), streaming (EOU 120M RNNT with EOU, Nemotron 600M multilingual with 80–1120 ms latency), and diarization (Sortformer 117M, up to 4 speakers).
  • The shared pipeline converts 16 kHz mono WAV to 80-bin Mel spectrogram, feeding a FastConformer encoder, with decoder options CTC (greedy) or TDT (default, higher accuracy).
  • APIs include high-level classes for offline/streaming transcription and diarization, support for word-level timestamps, microphone streaming, and streaming diarization with arrival-order speaker tracking.

Hottest takes

"Runs 7 model families" — noahkay13
"Hoe does this compare?" — nullandvoid
"Qwen-asr can easily transcribe live radio" — antirez
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.