January 6, 2026
Scantron is canceled
DatBench: Discriminative, faithful, and efficient VLM evaluations
DatBench exposes broken AI report cards; internet wants essay exams
TLDR: DatBench launches a cleaned, faster way to test vision–language AI, revealing big performance drops once guesswork is removed. Commenters celebrate the takedown of “scantron AI,” while skeptics warn essay-style grading could be subjective and gamed—setting off a truth-versus-speed showdown.
The paper behind DatBench lit up the comment sections by basically saying: our AI report cards are trash. Vision-language models (VLMs—systems that read images and text together) are being graded with multiple-choice tests that reward guessing, and up to 70% of questions can be answered without even looking at the image. Add in mislabeled samples (as high as 42%), and you’ve got a class where cheating is easy and grades don’t mean much. The authors flip those tests into open-ended, generative tasks and—boom—capability drops of up to 35%. They also clean and curate benchmarks and claim 13x average speedups (up to 50x) while keeping evaluations honest. Link for the curious: arXiv: DatBench.
Cue the drama. The top comments call this a “scantron culture reckoning,” accusing AI companies of teaching to the test. Others clap back: essay-style answers sound fair until you realize grading them can be subjective and ripe for gaming. One camp cheers the compute savings—“stop burning 20% of GPUs on scorekeeping”—while skeptics ask who audits the graders and whether DatBench just creates a new leaderboard to overfit. Memes arrived fast: “AI took the test with its eyes closed,” “Scantron is canceled,” and a chorus of “DatBench? More like That Bench.” Even the speed claim sparked snark: “13x faster at exposing fluff.”
Key Points
- •Three desiderata for VLM evaluation are proposed: faithfulness, discriminability, and efficiency.
- •Failure modes include multiple-choice formats that reward guessing, blindly solvable questions (up to 70%), and mislabeled/ambiguous samples (up to 42%).
- •Evaluation compute is substantial, with reports of nearly 20% of development compute devoted to evaluation.
- •Converting multiple-choice to generative tasks reveals capability drops of up to 35%.
- •DatBench-Full (33 datasets across nine capabilities) and DatBench (13x average speedup, up to 50x) are released to improve evaluation rigor and efficiency.