Why removing 'um' from a recording is harder than it sounds

Turns out deleting awkward “ums” is a nightmare — and commenters are obsessed

TLDR: A developer built a tool to remove “um” and “uh” from recordings, but the big surprise is how messy that job gets when bad cuts make audio sound worse. Commenters were fascinated, joking about Jurassic Park, asking for video editing features, and laughing at the irony of AI adding filler words just so other AI can delete them.

What should have been a simple “delete the awkward bits” job has turned into a full-blown audio editing soap opera. The developer behind a tool called erm discovered that cleaning up spoken recordings isn’t just about spotting every “um” and snipping it out. If you cut in the wrong place, you get ugly clicks, weird jumps in background noise, and speech that sounds even more awkward than the original. In other words: the internet’s dream of one-click confidence is officially delayed.

And the comment section? Absolutely loving the chaos. One reader summed up the mood perfectly: this isn’t “find and replace” for your voice. Another instantly went full meme mode, announcing plans to test it on “a certain clip from Jurassic Park,” which is exactly the kind of unhinged energy this story deserves. Others saw practical gold here, with one commenter begging for video editor support so creators could scrub through every cringe pause and decide which verbal hiccups deserve mercy.

The funniest hot take came from the AI irony crowd: first tech companies taught computers to add fake human filler words, and now another tool is removing them. As one commenter basically put it, AI is now creating the problem and selling the cure. There wasn’t much outright fighting, but there was plenty of amused disbelief, nerdy admiration, and that very internet-specific mix of “this is brilliant” and “wow, humans really built software for every possible insecurity.”

Key Points

  • The article presents erm, a tool for removing spoken filler words such as “um” and “uh” from audio recordings.
  • A simple approach based only on Whisper transcripts and timestamp cuts fails because fillers are often omitted, waveform cuts can click, and background noise can shift across edits.
  • erm uses OpenAI’s Whisper via faster-whisper, requesting word-level timestamps and prompting the model not to clean up the transcript.
  • Beyond transcript matching, erm adds three audio-based detection passes for fillers in long pauses, fillers merged into adjacent words, and suspiciously long word tails.
  • The article says high-quality filler removal depends on refining cut points against the actual audio rather than slicing at raw timestamps.

Hottest takes

"not a find and replace type thing" — dougcalobrisi
"I’m going to try this on a certain clip from Jurassic Park" — rindalir
"AI puts it in, AI helps take it out" — cryptoz
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.