June 3, 2026

Bot Fight Club, now with receipts

I built a vulnerable app and spent $1,500 seeing if LLMs could hack it

He paid $1,500 to let AI loose on his app, and the comments turned into a cage match

TLDR: A researcher spent $1,500 testing whether AI bots could break into a fake app, and GPT-5.5 came out on top. The comments instantly turned it into a fairness war, with people arguing the real story was safety restrictions, special permissions, and whether the whole contest was rigged from the start.

A security researcher built a deliberately weak book-review app, handed it to a bunch of artificial intelligence chatbots, and watched who could break in first. The headline result: GPT-5.5 won the chaos contest, solving the challenge 7 out of 10 times, while several rivals completely whiffed. But in the comments, readers were far less interested in the scoreboard than in the messy drama behind it: was this a real test of smarts, or just a test of which bot had fewer brakes on it?

That’s where the popcorn started flying. One camp argued Anthropic’s Claude models looked bad not because they were bad, but because their safety rules kept shutting down legit security work. Another reader piled on with a very relatable complaint: Claude refused at first during app testing, then calmed down only after the user basically said, “Relax, it’s my app.” On the other side, critics said the comparison was unfair from the jump because the OpenAI model was reportedly cleared in advance for security research, making the contest feel a little like a race where one runner got special shoes.

Then came the classic internet flexes. One commenter called the whole method “quite naive” and insisted other models can handle much tougher challenges, from patching programs to dodging anti-debug tricks. And the weirdly glamorous subplot? A commenter name-dropped Apple, NDAs, and a secretive model called Mythos, which gave the thread big “forbidden tech whispered about in dark hallways” energy. So yes, the app got hacked—but the real action was everyone fighting over what the result even means.

Key Points

  • Kasra Rahjerdi built a fake mobile app and backend to test whether LLMs could exploit a Firebase-related access control flaw and retrieve a flag from private reviews.
  • The challenge app used a React Native Expo frontend, a Python/FastAPI backend, and Firebase as the data layer, with exposed configuration in google-services.json enabling the intended attack path.
  • Rahjerdi says the vulnerability reflects a real-world class of issues seen in Firebase and Supabase apps, described as Broken Access Control or Missing Object-Level Authorization.
  • The experiment was described as non-scientific, used a $10 budget cap and two-hour limit per run, and cost about $1,500 before the author stopped testing.
  • In the reported 10-run results, gpt-5.5 led with 7 solves, deepseek-v4-pro had 3, claude-sonnet-4.6 and claude-opus-4-8 had 2 each, and several other models had 0 solves.

Hottest takes

"It’s not because of capability, it’s because Anthropic’s guardrails prevented it from solving the problem" — SOLAR_FIELDS
"The methodoly used is quite naive" — mariopt
"A more fair comparison would be a vanilla GPT account" — mynameisvlad
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.