Show HN: A new benchmark for testing LLMs for deterministic outputs

New AI scorecard drops, and the comments instantly call foul, hype, and nonsense

TLDR: A new AI benchmark tries to measure whether chatbots can extract facts correctly, not just spit out neat-looking data. Commenters immediately split between "finally, this measures the real problem" and "nice chart, but where are the missing top models and does this matter if scores are already high?"

A new benchmark called SOB just entered the chat, promising to test whether chatbots can turn messy real-world stuff—like invoices, medical notes, and transcripts—into clean, reliable data without quietly making things up. The creators say the old way of grading AI was too easy: it mostly checked whether the answer looked like valid computer-readable text, not whether the actual facts inside were right. In other words, the AI might hand in a neat form that is beautifully wrong—and that’s exactly what breaks real businesses.

But the real fireworks were in the comments. One camp basically said, cool benchmark, but this still feels fragile in the real world. A top reaction argued that asking one model to both understand the source and format it perfectly in one shot is like asking someone to cook dinner and file your taxes at the same time. Another commenter went straight for the leaderboard drama: if this is a serious ranking, why are some big-name models missing? That complaint gave the thread immediate "award show snub" energy.

Then came the skeptics: if most models already score high, does this benchmark actually tell us anything new? Others pushed back, saying yes—because the real danger isn’t the obvious crash, it’s when the AI gives you a valid-looking answer that’s subtly false and nobody notices until production explodes. The vibe was part lab debate, part comment-section food fight, with a strong undercurrent of "numbers are nice, but can I trust this thing with my invoices?"

Key Points

•The article introduces SOB as a benchmark for testing LLMs on deterministic structured-output tasks across three modalities.
•It argues that schema compliance alone is insufficient because production systems depend on correct values, not just parseable JSON.
•Each benchmark record includes a JSON Schema and a human-verified ground-truth answer, with image and audio inputs converted to text-normalized context before scoring.
•SOB uses seven metrics and two scoring gates to distinguish structural validity from actual value correctness and to prevent inflated scores.
•The reported results for 21 models show structural metrics clustering near the ceiling, while Value Accuracy and Perfect Response Rate provide stronger separation between models.

Hottest takes

"solve the task first, and then return it as JSON in a separate LLM call" — jumploops

"Top 5 at a glance" and it missed key frontier models — stared

"'Valid' but incorrect data is what actually reaches production" — ossianericson

April 29, 2026

JSON passes, trust crashes

New AI scorecard drops, and the comments instantly call foul, hype, and nonsense

Key Points

Hottest takes

April 29, 2026

JSON passes, trust crashes

Show HN: A new benchmark for testing LLMs for deterministic outputs

New AI scorecard drops, and the comments instantly call foul, hype, and nonsense

Key Points

Hottest takes

Save News