Sopro TTS: A 169M model with zero-shot voice cloning that runs on the CPU

Laptop-ready voice cloning thrills fans while audiophiles want cleaner sound

TLDR: Sopro TTS clones voices from short clips and runs on regular laptops—no graphics card needed. The crowd loves the accessibility but argues over audio quality, with some hearing a “warble” and others pointing to Chatterbox for cleaner (but slower) sound, setting up a speed vs. fidelity showdown.

A scrappy side‑project just crash‑landed into the TTS world, and the comments are having a party. Sopro TTS is a lightweight English text‑to‑speech (TTS) tool that can clone a voice from just a few seconds of audio and runs on your computer’s main processor (CPU)—no fancy graphics card needed. It even streams speech and churns out around 30 seconds of audio in about 7.5 seconds. The creator admits it’s not top‑of‑the‑line, but the community loves the hustle and the accessibility, with one user calling it “Mission impossible cloning skills” and another planning to use it for alerts on hardware without GPUs. Translation: it works on office PCs and budget boxes, and people are hyped.

Then the drama hits: sound‑snobs say they hear a “warble” on long vowels and want a bigger, cleaner version. A mini‑rivalry erupts as folks keep pointing to Chatterbox‑TTS‑Server—slower, but “quite high quality”—as the benchmark. Meanwhile, the cheer squad argues Sopro’s speed and ease beat perfection for everyday uses. The thread swings between “wow, it runs on a laptop!” and “give us studio‑grade polish,” with one commenter dropping a vintage rhyme and a YouTube throwback for meme points. Verdict: a budget hero that’s sparking a tasty fight between speed and fidelity, and everyone wants to see the next, cleaner version.

Key Points

  • Sopro TTS is a 169M-parameter English text-to-speech model using dilated convolutions and lightweight cross-attention instead of Transformers.
  • The model supports streaming synthesis, zero-shot voice cloning, and achieves ~0.25 RTF on an M3 CPU (30s audio in 7.5s).
  • Voice cloning works with 3–12 seconds of reference audio; quality can vary with mic and ambient conditions.
  • Installation is available via PyPI or GitHub, with CLI and Python APIs; an interactive streaming demo runs via Uvicorn or Docker.
  • Training used pre-tokenized data with raw audio discarded due to budget, potentially limiting speaker embedding fidelity; non-streaming offers best quality.

Hottest takes

"Has a slight warble in the voice on long vowels" — convivialdingo
"the best alternative is Chatterbox-TTS-Server" — realityfactchex
"Mission impossible cloning skills without the long compile time" — blitzar
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.