June 29, 2026

Bird model or benchmark cosplay?

Ornith-1.0: self-improving open-source models for agentic coding

A new coding AI drops big score claims, but commenters are asking: is it real or just a remix

TLDR: Ornith-1.0 says it’s a powerful free coding AI with top open-source scores, but the community immediately turned skeptical. The big debate: is this a real step forward or just an old model in new packaging with flashy test results?

Ornith-1.0 arrived with huge bragging rights: an open-source coding helper in several sizes, free to use, globally available, and claiming eye-popping results on tests where AI models try to fix code and complete software tasks. In plain English, the pitch is simple: this thing is supposed to be a powerful robot coding assistant that can also "improve" how it goes about solving problems. But the real show wasn’t the benchmark charts — it was the comment section, where readers instantly put on their detective hats.

The loudest reaction was basically: "Wait, what exactly is this?" One commenter flat-out asked if Ornith is just a dressed-up version of existing models from Qwen or Gemma, while another dismissed the whole thing as a benchmark-optimized remix. That set off the classic open-source AI argument: is this a genuine breakthrough, or just clever repackaging with prettier numbers? Even the project details got side-eyed, with people noticing a promised 31B version that didn’t seem to show up in the actual weights or score tables.

Still, not everyone came to throw tomatoes. One user said this was the first Qwen fine-tune the local AI crowd hadn’t instantly rejected, and even called it creative for coding help. So the vibe is deliciously split: some see a real contender, others see marketing smoke and mirrors, and everyone seems united on one thing — they want receipts, not just charts. In other words, Ornith didn’t just launch a model; it launched a mini-comment war.

Key Points

  • The article presents Ornith-1.0 as an MIT-licensed, open-source family of agentic coding models available in multiple sizes, including 9B, 35B, and 397B variants.
  • It says Ornith-1.0 is post-trained on top of Gemma 4 and Qwen 3.5 and uses reinforcement learning to optimize both solution rollouts and the scaffolds that guide them.
  • Benchmark tables report results across Terminal-Bench 2.1, SWE-bench variants, NL2Repo, ClawEval, and SWE Atlas against size-matched open and closed model baselines.
  • The 35B and 397B variants are reported to outperform several Qwen, Gemma, and other large-model baselines on many listed coding benchmarks, while Claude Opus 4.8 leads on several top-end comparisons.
  • The article includes detailed evaluation settings such as harnesses, temperatures, context windows, hardware/time constraints, and notes that the model outputs reasoning and tool-call structures compatible with serving parsers.

Hottest takes

"Is this just a re-skinned qwen?" — kennywinker
"These are simply benchmaxxed versions of either Qwen or Gemma 4." — S0y
"the first Qwen fine-tune that is not immediately rejected by the local LLM community" — ricardobayes
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.