DatBench: Discriminative, faithful, and efficient VLM evaluations

DatBench exposes broken AI report cards; internet wants essay exams

TLDR: DatBench launches a cleaned, faster way to test vision–language AI, revealing big performance drops once guesswork is removed. Commenters celebrate the takedown of “scantron AI,” while skeptics warn essay-style grading could be subjective and gamed—setting off a truth-versus-speed showdown.

The paper behind DatBench lit up the comment sections by basically saying: our AI report cards are trash. Vision-language models (VLMs—systems that read images and text together) are being graded with multiple-choice tests that reward guessing, and up to 70% of questions can be answered without even looking at the image. Add in mislabeled samples (as high as 42%), and you’ve got a class where cheating is easy and grades don’t mean much. The authors flip those tests into open-ended, generative tasks and—boom—capability drops of up to 35%. They also clean and curate benchmarks and claim 13x average speedups (up to 50x) while keeping evaluations honest. Link for the curious: arXiv: DatBench.

Cue the drama. The top comments call this a “scantron culture reckoning,” accusing AI companies of teaching to the test. Others clap back: essay-style answers sound fair until you realize grading them can be subjective and ripe for gaming. One camp cheers the compute savings—“stop burning 20% of GPUs on scorekeeping”—while skeptics ask who audits the graders and whether DatBench just creates a new leaderboard to overfit. Memes arrived fast: “AI took the test with its eyes closed,” “Scantron is canceled,” and a chorus of “DatBench? More like That Bench.” Even the speed claim sparked snark: “13x faster at exposing fluff.”

Key Points

  • Three desiderata for VLM evaluation are proposed: faithfulness, discriminability, and efficiency.
  • Failure modes include multiple-choice formats that reward guessing, blindly solvable questions (up to 70%), and mislabeled/ambiguous samples (up to 42%).
  • Evaluation compute is substantial, with reports of nearly 20% of development compute devoted to evaluation.
  • Converting multiple-choice to generative tasks reveals capability drops of up to 35%.
  • DatBench-Full (33 datasets across nine capabilities) and DatBench (13x average speedup, up to 50x) are released to improve evaluation rigor and efficiency.

Hottest takes

"Stop letting AI guess letters like it’s a Scantron" — @grumpygrad
"Turning multiple choice into essays just moves the goalposts" — @metrics_maniac
"13x speedup? Cool—are we 13x closer to honesty?" — @open_sourcerer
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.