December 4, 2025
No ifs, ands, or butts
The End of the Train-Test Split
From butt rules to AI chaos: commenters say fix the data, not the hype
TLDR: An old “butt detector” now challenged by complex policy rules sparks a claim that train–test splits don’t work for big AI tasks. Commenters fire back: fix the labels, use embeddings, and stop the hype—this is the same old machine learning problem with messier rules and not enough experts.
An 8-year-old “butt detector” turned policy epic has the internet cackling and arguing. The author claims the old-school train–test split — the classic way to check if a model works — breaks down when you add a 10-page adult-content rulebook and a chatty AI. Cue community drama: one camp, led by elpakal, says the real villain is bad labels — spend time with policy experts, not bigger models. Another camp, like roadside_picnic, is baffled by “weird direct calls to the LLM”, insisting smarter combinations of text and image embeddings would beat prompt spaghetti.
Then come the skeptics: stephantul isn’t buying a revolution, calling it the same old ML grind dressed up in LLM buzzwords. And the memes? Henning’s dry “Two months later, you’ve cracked it” got a chorus of “Hehe,” while commenters turned country-specific crack limits into “Cleft-o-meters” and “ButtBot bingo.”
Under the jokes, there’s a serious split: Is the solution to ditch training and test on “blind” data, or to admit you need better guidance, clearer rules, and expert time? The crowd’s verdict: you can’t debug gray-area policies with math alone, and no amount of AI flair fixes messy, human-made rules without humans in the loop.
Key Points
- •A 2015 CNN-based image classifier at Facebook achieved 92% precision and 98% recall under region-specific content rules.
- •In 2023, a policy-driven, context-aware LLM decision tree stalled at ~85% precision/recall; a CNN retrained on new labels scored ~83%.
- •Policy review found label quality issues: ~60% correct, ~20% incorrect, ~20% edge cases not covered.
- •The article argues train-test splits are unsuitable for complex LLM classification tasks requiring scarce expert labels and nuanced policies.
- •It recommends not training LLMs on the dataset and using blind, unlabeled evaluation, emphasizing clearer rules and close policy–engineering collaboration.