Arena AI Model ELO History

New AI score chart promised nerf drama — commenters instantly started fighting about the math

TLDR: A new chart tracks how top AI bots rise and fall over time, claiming it can expose when companies quietly make them worse. Commenters immediately split into camps: some say it reveals real trust issues, while others say the scores only show tougher competition, not secret sabotage.

A new Arena AI Model ELO History chart is trying to do something the internet absolutely loves: catch AI companies in the act of quietly making their chatbots worse. The pitch is juicy — maybe your favorite bot didn’t suddenly get “dumber,” maybe it was nerfed behind the scenes. But the comments wasted no time turning this from a data project into a full-on nerd brawl over whether the chart can prove that at all.

The biggest fight? A bunch of commenters said the chart’s headline promise is way more dramatic than the evidence. Several people argued that these scores are relative, meaning older models can slide simply because newer, stronger rivals showed up — not because anyone secretly sabotaged them. In other words: maybe it’s not decay, maybe the competition just got hotter. One commenter basically called the whole thing “slop,” accusing it of cashing in on the popular “AI got nerfed” vibe while mostly showing lines that go up.

Then came the correction squad. One user popped in with the delightfully petty fact-check that Elo isn’t an acronym at all — it’s a person’s last name — which is exactly the kind of comment-section energy this story deserves. And in the middle of the suspicion storm, an OpenAI employee jumped in to flatly deny any secret “under high load we swap in worse models” conspiracy, insisting there are no shady time-of-day tricks. So yes, the chart is about AI performance — but the real spectacle is the comment thread: part watchdog debate, part semantics fight, part meme-worthy trust crisis.

Key Points

  • The article presents a chart that tracks the Elo history of each major AI lab’s flagship model lineage over time.
  • It states that post-launch model updates can alter behavior, including stronger censorship, quantization, or other degradation.
  • The chart is based on LMSYS Arena API evaluations, which the article distinguishes from consumer web interfaces that may add prompts, filters, or wrappers.
  • Data is fetched daily from the official LM Arena Leaderboard Dataset hosted on Hugging Face and is based on blind, crowdsourced human evaluations.
  • The methodology uses one curve per AI lab, keeps the highest-rated flagship-eligible model active, merges inference-mode variants, and marks new releases as labeled points.

Hottest takes

"the decays are just more capable other models entering the population" — underyx
"Elo isn’t an acronym - it’s a person’s name" — tedsanders
"Is this slop?" — refulgentis
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.