Better Models: Worse Tools

The smartest chatbots are suddenly flubbing basic chores, and the crowd is not calm

TLDR: A developer found newer high-end AI models are more likely than older ones to mess up simple tool instructions, causing tasks to fail for silly reasons. Commenters turned it into a bigger fight about whether cloud AI can ever be trusted, with jokes about browser-war chaos and "smart" systems that still can't do basic chores.

A fresh tech gripe from Armin Ronacher has set off a very online round of "how are the new models worse at this?" outrage. His complaint is simple enough for non-experts: newer versions of Anthropic’s chatbot family can often make the right edit to a file, but they package it the wrong way, adding made-up fields so the tool rejects the request. Translation: the expensive brain knows what to do, then trips over the form it has to fill out. And yes, the comments immediately turned this into a full-blown trust issue.

The strongest reaction was a big, sarcastic "welcome to relying on closed cloud tools". One commenter praised the write-up but confessed they’re now "somewhat worried," while others went much harder, saying building dependable products on top of unpredictable rented AI is asking for chaos. Another compared today’s model mess to the old browser wars, with every system needing its own special handling. In other words: developers fear they’re rebuilding Internet Explorer headaches, but for chatbots.

Then came the comedy. One developer said they tried a simpler patch format and discovered all models are terrible at basic line numbers. Another joked that some systems seem haunted by old Codex habits, blurting out weird patch formats like a ghost from AI past. And the iciest hot take of all? An open-source developer being "surprised and concerned" that beloved proprietary software is getting shakier. Ouch.

Key Points

  • Armin Ronacher investigated a Pi issue in which newer Anthropic models sometimes generate edit tool calls containing invented keys inside the `edits[]` array.
  • The article says Claude Opus 4.8 and Sonnet 5 showed this schema-mismatch behavior, while older models in the same family did not in his testing.
  • Ronacher explains that LLM tool calls are produced through in-band text formatting rather than a separate native mechanism.
  • He describes a likely Anthropic-style serialization using ANTML-like markers, with inline simple parameters and JSON-serialized arrays of objects.
  • The article contrasts post-generation validation of JSON with grammar-aware or constrained decoding that blocks invalid schema tokens during sampling.

Hottest takes

"All models are terrible at generating line numbers for a proper diff, give up on them" — mappu
"It reminds me of the ancient times when browsers all read HTML and CSS differently" — lukasco
"Open source developer surprised and concerned by the trajectory their favorite proprietary software is taking" — ares623
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.