March 5, 2026

Talk is cheap—’til your Mac talks back

Nvidia PersonaPlex 7B on Apple Silicon: Full-Duplex Speech-to-Speech in Swift

Apple Macs get a chatty voice: commenters say it's just a demo, not a Siri killer

TLDR: Nvidia’s PersonaPlex 7B brings voice-in, voice-out chat to Apple Macs without using text, but today it’s a demo that processes a recording. Comments clash over whether it’s new or risky, with calls for real live conversation, tool use, and stronger safety controls.

Nvidia’s PersonaPlex 7B just taught Apple Macs to hold a two‑way voice chat: it listens and talks at the same time, no text in between. The Swift library collapses the usual speech‑to‑text → thinking → text‑to‑speech pipeline into one model, aiming for faster, more natural replies. It builds on Kyutai’s Moshi voice tech and ships as a converted, Apple‑friendly version of the original model on Hugging Face. Think of it as a “single brain” that hears you and answers back, trying to keep tone and emotion intact. Techies cheered the clever compression and codec reuse, but that’s where the applause stopped.

Commenters quickly split into camps. The hype‑curious crowd said “cool tech” and immediately asked for different compression levels—one user joked they’d “8‑bit” it before lunch. The skeptics pounced: “already used by the big players,” claimed one, saying this isn’t new and conversation modes have done it for ages. The mood soured when a download‑and‑try fan reported it’s not real‑time chat yet—just a proof of concept that processes a recorded clip. Meanwhile, wishlist warriors demanded a true “do‑everything” assistant that uses text tools while talking, and the safety squad waved red flags, linking a cautionary news story and warning about voice models gone rogue. Meme of the day: “Full‑duplex flex vs WAV‑only club.”

Key Points

  • A Swift/MLX library runs NVIDIA’s PersonaPlex 7B on Apple Silicon for full‑duplex speech‑to‑speech without text intermediates.
  • The model processes 17 parallel token streams at 12.5 Hz and is based on Kyutai’s Moshi architecture, with 18 voice presets and role prompts.
  • A conversion script transforms the 16.7 GB PyTorch checkpoint into 4‑bit MLX safetensors for both the 7B temporal transformer and Depformer.
  • The audio pipeline uses Kyutai’s Mimi codec (SEANet, streaming convs, 8‑layer transformer bottleneck, Split RVQ) reused from prior TTS work.
  • The Depformer employs a MultiLinear sliced‑weight pattern; 4‑bit quantization reduces its memory from ~2.4 GB to ~650 MB (~3.7×) with no reported quality loss.

Hottest takes

"already for quite a long time being used by the big players" — WeaselsWin
"I'd skip this for now - it does not allow any kind of interactive conversation" — vessenes
"This sounds quite dangerous" — 4dregress
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.