Show HN: I trained a 9M speech model to fix my Mandarin tones

Tiny tone tutor melts HN: shy learners rejoice, Farsi fans rally, cynics fear a sellout

TLDR: A developer built a tiny on-device tool that grades Mandarin tones so learners get precise feedback. HN loves the confidence boost and wants other languages like Farsi, while a meta debate flares over AI philosophy and some fear the project could “sell out” if it grows.

Hacker News is swooning over a DIY Mandarin “tone cop” you can try here. The creator trained a tiny speech model to grade your tones instead of autocorrecting them, and commenters are treating it like a pocket coach for anyone too shy to practice with humans. One user gushes that it’s “instantly awesome” and a confidence boost, calling it the perfect compromise for anxious speakers.

Then the chorus starts: “Do more languages!” A top request is Persian/Farsi/Dari, with one commenter lamenting how overlooked it is for learners and begging for the same treatment. Others bring culture and comedy: a Taiwan alum admits they used to wave their hand to trace tones—yes, like “conducting an invisible orchestra”—and warns the app should handle accents and regional quirks, not just textbook-perfect Mandarin.

There’s drama, too. The post name-drops the famous “bitter lesson” (the idea that more data and computation beat hand-tuned tricks), sparking a meta debate: if that’s true, where do small, on-device models fit? One commenter wonders if the originator of the idea even believes today’s giant chatbots are the right path. Meanwhile, a cynic frets this tool might “get too big and start sucking like everything,” capturing the community’s ongoing fear that beloved indie projects sell out or bloat.

Tech aside, people love that it tells you exactly where your tone goes wrong, like a brutally honest coach—no jargon, just feedback. The vibe: applause, memes, feature requests, and a little side-eye at scale.

Key Points

  • A 9M-parameter, on-device CAPT model was trained on ~300 hours of transcribed speech to grade Mandarin pronunciation.
  • The architecture uses a Conformer encoder with CTC loss to capture local spectral details and global tonal context without autocorrecting.
  • Initial pitch visualization via FFT and heuristics (inspired by Praat) proved brittle, prompting a shift to learned representations.
  • Forced alignment is achieved using the Viterbi algorithm to align frame-level CTC probabilities to token sequences in time.
  • Outputs are tokenized as Pinyin with tone as first-class tokens, rather than Hanzi, to expose tone and pronunciation errors.

Hottest takes

"conducting an invisible orchestra" — vunderba
"Farsi is a VERY overlooked language" — jellojello
"worried this might get too big and start sucking like everything" — drekipus
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.