April 29, 2026

Booked, busted, and badly quoted

Finetuning Activates Verbatim Recall of Copyrighted Books in LLMs

AI caught reciting whole books, and the comments instantly turned into a copyright cage match

TLDR: Researchers found that tuning an AI model can make it repeat copyrighted books almost exactly, raising big legal and ethical questions. In the comments, people split between lawsuit predictions, anti-copyright rants, and gleeful jokes about bots accidentally becoming Tolkien fanfic machines.

A new research paper dropped a genuinely awkward finding for the artificial intelligence world: when these chatbots are fine-tuned, they can suddenly start spitting back long chunks of copyrighted books almost word for word. In plain English, a model that seemed restrained can be nudged into becoming that one guy at the party who somehow knows entire novels by heart. The researchers were so careful about it that they wouldn’t even include the full examples because the generated text itself contained large passages from copyrighted books.

But the real fireworks were in the community reaction. One camp basically yelled, “This is the real copyright bomb” and predicted a Napster-style legal meltdown, with one commenter warning that sooner or later someone will get sued for redistributing AI-generated text that turns out to be stolen. Another crowd used the moment to revive the internet’s favorite never-ending war: copyright terms are way too long. That take came in hot, with people arguing that everything from The Lord of the Rings to Harry Potter should already be public domain by now.

Then came the chaos goblins. One commenter casually admitted the news made them want to upload better scans to shadow libraries so future chatbots can answer obscure academic questions. And for comic relief, another user baited Claude with “In a hole in the ground there lived a…” and got a suspiciously perfect Tolkien-style answer. The vibe was part legal panic, part pirate energy, part nerd comedy — and absolutely nobody was calm.

Key Points

  • The repository accompanies the paper *Alignment Whack-a-Mole: Finetuning Activates Verbatim Recall of Copyrighted Books in Large Language Models* and includes preprocessing, finetuning, evaluation, and analysis code.
  • Only partial example files are distributed because the full books are copyrighted and the model generations contain large portions of verbatim text.
  • The preprocessing pipeline converts EPUB books into 300–500 word excerpt chunks, re-segments long excerpts with GPT-4o, merges short chunks, and generates plot summaries for finetuning prompts.
  • The finetuning instruction format asks a model to write an excerpt of a specified length while emulating the style and voice of a named author using a plot summary as content.
  • The repository supports finetuning and generation workflows for GPT-4o, Gemini-2.5-Pro, and DeepSeek-V3.1, and samples 100 completions per excerpt at temperature 1.0.

Hottest takes

"The idea that I could eventually ask ChatGPT..." — TFNA
"The Lord of the Rings should be in the public domain" — x-complexity
"the industry will face a Napster-style reckoning" — rectang
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.