Pure C, CPU-only inference with Mistral Voxtral Realtime 4B speech to text model

A no-frills speech-to-text drops; Mac folks cheer, Linux hacks, Rust fans roll in

TLDR: A no-dependency C tool brings Mistral’s 4B speech model to everyday machines, streaming words as they’re heard. Mac users cheer while Linux folks wrestle with mic capture, and the Rust vs C rivalry heats up; the crowd wants better docs and paths for training tricky dialects.

Move over, Python — a pure C, CPU-only speech-to-text landed, promising word-by-word streaming without big toolkits. The dev even throws shade at Mistral’s vLLM tie-up, dropping a simple Python reference plus a no-dependency C engine. The crowd? Split. One Linux user gushes that install was “a breeze” — then slams into the wall: realtime mic isn’t working on Linux, with the --from-mic flag only for Mac. People are duct-taping ffmpeg pipes like it’s 2010 and posting command line incantations. Cue the meme: “press F for ffmpeg.”

Meanwhile, a learner begs for a starter path to train dialects and jargon, turning the thread into a guidance counselor’s office. Others obsess over speed: Apple’s Metal GPU (MPS) flies, BLAS on Linux trudges thanks to weight conversions, and everyone keeps repeating that the model is a chunky ~8.9GB. The best subplot? Rust vs C: a Rust runtime is racing this release on the front page, sparking jokes about an “audio runtime rumble.” Love it or roll your eyes, the vibe is clear: open weights, open drama, and a community that wants mic streaming on Linux, real docs, and something that can survive very long recordings without melting. For now, tokens stream to stdout, the encoder chunks audio to cap memory, and the KV cache rolls like a treadmill — but users want the mic to actually work without a ritual.

Key Points

  • A pure C implementation provides inference for Mistral AI’s Voxtral Realtime 4B model with zero external dependencies.
  • Supports GPU acceleration via Metal (MPS) on Apple Silicon and BLAS on Intel Mac/Linux; BLAS path is slower due to BF16-to-FP32 conversion.
  • Audio handling includes chunked encoding with overlapping windows, streaming from stdin, and a streaming C API for incremental processing.
  • Weights are memory-mapped from safetensors (BF16) for near-instant loading; supports WAV input with auto-resampling to 16 kHz.
  • A simple Python reference implementation is included; the project is lightly tested and seeks validation on long transcriptions to stress the rolling KV cache (8,192 sliding window).

Hottest takes

"This was a breeze to install on Linux. However, I haven't managed to get realtime transcription working yet" — Curiositry
"but like tricky dialects and use of various terminologies but I'm still confused as to where to start" — sgt
"Funny, this and the Rust runtime implementation are neck and neck on the frontpage right now" — written-beyond
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.