Show HN: Sup AI, a confidence-weighted ensemble (52.15% on Humanity's Last Exam)

Fans cheer accuracy, skeptics see sticker shock — is 7% worth an AI dogpile

TLDR: Sup AI claims the top score on a tough multi‑subject test by stitching together 337 models and cross‑checking answers, hitting 52.15%. Commenters split: fans want fewer wrong answers at work, while skeptics question latency, cost, and whether a 7% gain and self‑run results justify the hype.

Sup AI stormed into Show HN boasting a 337‑model “super‑team” that cross‑checks itself, claims fewer made‑up facts, and just nabbed 52.15% on the brutal Humanity’s Last Exam. Cue the comments: engineers immediately worried about lag and bills, with one grilling the team on how this many models won’t turn your wait time into a loading screen from 2004. Others wanted receipts beyond one test, asking for coding benchmarks and real‑world comparisons, not just a flex on a lab exam and a self‑run white paper.

Still, pragmatists who deal with wrong answers at work are intrigued. One enterprise user literally said they’ll try Sup AI at the office because the promise of fewer bad suggestions sounds like relief from endless “hallucination” debugging. Another commenter calmly explained “entropy” in normal‑people terms, translating the math‑speak about token probabilities into “how sure the model is” — a mini crash course mid‑thread.

The drama centers on trade‑offs: is a 7‑point bump worth the cost and complexity of wrangling hundreds of models and their web searches? Sup’s bold line — “the only AI for when you need to be right” — made skeptics bristle, especially with the fine print that the benchmark was independently run by Sup, not an outside lab. The vibes? Part awe, part side‑eye, and a whole lot of “prove it… fast and cheap.”

Key Points

•Sup AI reports 52.15% accuracy on the HLE benchmark using a 337-model confidence-weighted ensemble.
•The ensemble outperforms the best individual model (~45%) by 7+ percentage points with p<0.001 significance on a 1,369-question sample.
•All models were evaluated with custom prompts and web search only (no code, calculators, or other tools).
•Sup AI uses token-level logprob confidence scoring, chunk-based verification, and real-time disagreement detection with targeted retries.
•An ensemble retrieval approach (keyword, embedding, visual) and support for 10 GB uploads with source transparency are part of the system.

Hottest takes

"how are you handling the orchestration layer overhead when one provider (e.g., Vertex or Bedrock) spikes in P99 latency?" — algolint

"I'll give Sup AI at try over the next few days at work." — hello12343214

"Is 7 extra percent on HLE benchmark really worth the cost of running an entire ensemble of models?" — wavemode

March 27, 2026

Seven Percent, 337 Models, Infinite Drama

Fans cheer accuracy, skeptics see sticker shock — is 7% worth an AI dogpile

Key Points

Hottest takes

March 27, 2026

Seven Percent, 337 Models, Infinite Drama

Show HN: Sup AI, a confidence-weighted ensemble (52.15% on Humanity's Last Exam)

Fans cheer accuracy, skeptics see sticker shock — is 7% worth an AI dogpile

Key Points

Hottest takes

Save News