The End of the Train-Test Split

From butt rules to AI chaos: commenters say fix the data, not the hype

TLDR: An old “butt detector” now challenged by complex policy rules sparks a claim that train–test splits don’t work for big AI tasks. Commenters fire back: fix the labels, use embeddings, and stop the hype—this is the same old machine learning problem with messier rules and not enough experts.

An 8-year-old “butt detector” turned policy epic has the internet cackling and arguing. The author claims the old-school train–test split — the classic way to check if a model works — breaks down when you add a 10-page adult-content rulebook and a chatty AI. Cue community drama: one camp, led by elpakal, says the real villain is bad labels — spend time with policy experts, not bigger models. Another camp, like roadside_picnic, is baffled by “weird direct calls to the LLM”, insisting smarter combinations of text and image embeddings would beat prompt spaghetti.

Then come the skeptics: stephantul isn’t buying a revolution, calling it the same old ML grind dressed up in LLM buzzwords. And the memes? Henning’s dry “Two months later, you’ve cracked it” got a chorus of “Hehe,” while commenters turned country-specific crack limits into “Cleft-o-meters” and “ButtBot bingo.”

Under the jokes, there’s a serious split: Is the solution to ditch training and test on “blind” data, or to admit you need better guidance, clearer rules, and expert time? The crowd’s verdict: you can’t debug gray-area policies with math alone, and no amount of AI flair fixes messy, human-made rules without humans in the loop.

Key Points

  • A 2015 CNN-based image classifier at Facebook achieved 92% precision and 98% recall under region-specific content rules.
  • In 2023, a policy-driven, context-aware LLM decision tree stalled at ~85% precision/recall; a CNN retrained on new labels scored ~83%.
  • Policy review found label quality issues: ~60% correct, ~20% incorrect, ~20% edge cases not covered.
  • The article argues train-test splits are unsuitable for complex LLM classification tasks requiring scarce expert labels and nuanced policies.
  • It recommends not training LLMs on the dataset and using blind, unlabeled evaluation, emphasizing clearer rules and close policy–engineering collaboration.

Hottest takes

"the machine learning engineer's priority should be spent working with policy teams to improve the data" — elpakal
"I can never understand why people jump to these weird direct calls to the LLM" — roadside_picnic
"I don’t really believe this is a paradigm shift" — stephantul
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.