LLMs Corrupt Your Documents When You Delegate

Even the smartest bots are quietly wrecking your files, and commenters say “told you so”

TLDR: Researchers found that handing long document-editing jobs to chatbots often leads to quiet but serious damage, even with the best systems. Commenters weren’t shocked at all — many said this is exactly why you should never let AI handle important files without constant human checking.

The paper’s headline finding is a trust nightmare: ask today’s big-name chatbots to handle a long chain of edits, and by the end they’ve quietly damaged about a quarter of the document on average. That’s not just a typo here or there. Researchers tested these systems across 52 real-world fields, from coding to music notation, and found that the longer you let the bot “take the wheel,” the more likely it is to slip in strange, silent errors. Even worse? Extra tools didn’t magically save the day.

But the real fireworks were in the comments, where the mood was basically “we warned you.” One user praised the testing method as unusually solid, saying it was shocking that the errors showed up even on tasks that should have been easy for computers. Others were far less charitable, calling repeated AI editing a kind of content rot that gets worse every pass. One especially cutting phrase stole the show: “semantic ablation” — a fancy way of saying the bot sands off meaning until your document turns bland, broken, or both.

There was some genuine debate, too. A few commenters wanted more detail on where the damage happens, while others argued that this is simply how these tools behave: they make random mistakes at every level, and whether disaster strikes depends on whether a human catches it. The accidental comedy came from people sharing war stories about bots misreading file names, mangling redirects, and slowly turning decent conversions into a mess. Translation: the community is not anti-AI — it’s anti-letting AI babysit your important stuff unsupervised.

Key Points

  • The article introduces DELEGATE-52, a benchmark for evaluating LLM performance in long delegated document-editing workflows.
  • DELEGATE-52 covers document editing across 52 professional domains, including coding, crystallography, and music notation.
  • The reported experiment evaluated 19 LLMs and found that current models degrade documents during delegation.
  • Even frontier models including Gemini 3.1 Pro, Claude 4.6 Opus, and GPT 5.4 corrupted an average of 25% of document content by the end of long workflows.
  • The article reports that agentic tool use did not improve performance, and degradation worsened with larger documents, longer interactions, and distractor files.

Hottest takes

“AI-washing any text will degrade it, compounding with each pass.” — causal
“‘Semantic ablation’ is my favorite term for it” — causal
“I have yet to find a model that does not make mistakes on every turn.” — adampunk
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.