Lambda Calculus Benchmark for AI

Leaderboard shocker as fans question the rules and roast the results

TLDR: A 120‑puzzle AI benchmark crowns gpt‑5.4 by a hair while the crowd freaks out that GPT‑5.5 scores lower. Commenters battle over one‑shot scoring versus multi‑run fairness, noting top models are basically tied—important because benchmarks steer hype, research focus, and big spending decisions.

A new brainteaser-style test for AIs just dropped, and the charts have everyone arguing. LamBench puts 21 models through 120 tiny logic puzzles from something called “lambda calculus” (think: ultra-minimal code riddles). The scoreboard? gpt‑5.4 edges out the pack with 110/120, Opus 4.6 and friends are right behind, and the internet immediately lit up. One stunned watcher summed it up: why is GPT‑5.5 trailing 5.4? That odd twist became instant meme fuel.

The real fireworks came over the rules. A top comment slammed the setup for being one-shot per problem, calling it unfair for “probabilistic” chatbots that can vary each try. Their demand: run each task 5, 15, even 45 times to see the true curve, not a single coin flip. Meanwhile, a hard-nosed crowd shrugged at the drama and said the scoreboard just confirms the obvious: top lab models are neck-and-neck; the rest aren’t close—so maybe cool it with the next “Opus killer” hype. Others asked for missing favorites (“Where’s Mistral?”), and old-school nerds cracked in-jokes about the creator and his beloved math toys.

If you want the receipts, the GitHub has the full list and live results. In short: new test, tight race, spicy comments, and a community roasting the rules while memeing the math.

Key Points

  • LamBench reports a leaderboard for a Lambda Calculus Benchmark covering 120 tasks and 21 models.
  • Top scores: gpt-5.4 (110/120, 91.7%), opus-4.6 (108/120, 90.0%), gpt-5.3-codex (107/120, 89.2%), opus-4.7 and gemini-3.1-pro-preview (106/120, 88.3%).
  • Mid-table models include sonnet-4.6 (99/120, 82.5%), gpt-5.5 (94/120, 78.3%), gpt-5.2 (93/120, 77.5%), and gpt-5.1 (90/120, 75.0%).
  • Lower scores include google/gemma-4-31b-it (22/120, 18.3%) and gpt-5.3-codex-spark (14/120, 11.7%).
  • A GitHub repository link is provided for LamBench, noting “21 models · 120 tasks.”

Hottest takes

"they are going to need to run each about 45 times." — dataviz1000
"Models from top labs are neck and neck, and the rest of the bunch are nowhere near." — NitpickLawyer
"Odd to see GPT 5.5 behind 5.4?" — cmrdporcupine
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.
Lambda Calculus Benchmark for AI - Weaving News | Weaving News