Show HN: The Legal Embedding Benchmark (MLEB)

Startup drops a 'legal IQ test' and—surprise—aces it, igniting comment court

TLDR: A startup launched MLEB, a big test for AI that reads legal text, and says its model scores highest and fastest. The comments applaud better legal benchmarks but roast vendor-run testing, challenge the dismissal of non-English data, and demand independent, global evaluations.

On Show HN, a startup unveiled the Massive Legal Embedding Benchmark (MLEB)—a giant test for AI that reads legal text—and declared its own Kanon 2 model both top scorer and fastest. Embeddings are just tools that turn words into numbers so search actually finds the right law.

The thread exploded. Fans cheered a serious, lawyer-grade test beyond contract snippets, but the hottest take was simple: “You built the exam your model aced.” One commenter dubbed them judge, jury, and benchmarker, while another asked for a neutral referee and public raw data. The post’s takedown of older tests—saying they were mislabeled, too US‑centric, and built by automation—lit another fuse. Researchers from India pushed back on the AILA critique, calling the “automated” jab a cheap shot without fresh evidence.

The spiciest fight? The claim that non‑English datasets add “noise.” Global lawyers called that an Anglo bubble, arguing real legal work crosses borders. Meanwhile, speed bragging drew memes: “Fastest to the verdict doesn’t mean correct,” and “Latency is not a law degree.” Jokes flew—“Major Legal Ego Battle,” “objection: relevance,” “hearsay labeling”—but top comments demanded independent benchmarking, clearer methodology, and head‑to‑head tests against general models before crowning a champ. Until then, verdict is pending

Key Points

•MLEB is introduced as a large, diverse legal embedding benchmark with 10 datasets across document types, jurisdictions, areas of law, and tasks.
•Kanon 2 Embedder reportedly ranks highest on MLEB while achieving the lowest inference time among commercial competitors.
•LegalBench-RAG is criticized for limited scope (4 datasets) focused solely on contracts and dominated by U.S. content.
•MTEB’s legal split is critiqued for mislabeling (especially AILA Casedocs and AILA Statutes) and insufficient diversity in English datasets.
•The authors designed MLEB to ensure high-quality labeling and provenance, real-world utility, challenging tasks, and broad representation.

Hottest takes

"Congrats on building a test your model aces—who graded it?" — corpuscle

"Objection: calling non‑English ‘noise’ is a bad look" — lawbytes

"Judge, jury, and benchmarker; the rest of us are the defendants" — briefcasebandit

November 5, 2025

Order in the comment court!

Startup drops a 'legal IQ test' and—surprise—aces it, igniting comment court

TLDR: A startup launched MLEB, a big test for AI that reads legal text, and says its model scores highest and fastest. The comments applaud better legal benchmarks but roast vendor-run testing, challenge the dismissal of non-English data, and demand independent, global evaluations.

Key Points

Hottest takes

November 5, 2025

Order in the comment court!

Show HN: The Legal Embedding Benchmark (MLEB)

Startup drops a 'legal IQ test' and—surprise—aces it, igniting comment court

TLDR: A startup launched MLEB, a big test for AI that reads legal text, and says its model scores highest and fastest. The comments applaud better legal benchmarks but roast vendor-run testing, challenge the dismissal of non-English data, and demand independent, global evaluations.

Key Points

Hottest takes

Save News