Improving 15 LLMs at Coding in One Afternoon. Only the Harness Changed

Tiny tool swap makes 15 AI coders smarter — the crowd says the interface is the real boss

TLDR: Swapping a single edit tool in the coding “harness” boosted 14 of 16 AI models, with one leaping from 6.7% to 68.3%. Comments split between harness-first cheering, line-number purists, and concurrency skeptics, plus calls for “VI for LLMs,” proving better plumbing can instantly make AI coders more useful.

Plot twist: the author touched nothing in the brains of 15 coding AIs—only swapped the edit tool in the “harness,” the control panel between the bots and your files—and boom, 14 of 16 models got better. One model, Grok Code Fast, jumped from 6.7% to 68.3% while saving 20–30% tokens. The star is Hashline, a new way to tell AIs exactly which lines to edit. Cue the comments: it’s the harness, not the model becomes the rallying cry, with one user pointing to a tweet showing a benchmark doubling just by switching harnesses.

Then the brawls begin. Minimalists push the line-numbers-only approach: fewer tokens, fewer mistakes, keep it simple. Old-school keyboard warriors dream of “VI for LLMs”—teach the bot to navigate text like a pro and give it special shortcuts. Tool nerds bring receipts: a table of contents tool to guide tiny models. But skeptics crash the party: “Great work, but concurrency is lost,” warning that once lines shift, you have to feed the whole file back, killing parallel edits. Jokes fly about AIs “not speaking Patch,” and memes crown Hashline the “checksum chaperone.” The vibe: excited tinkerers vs practical naysayers, all agreeing on one thing—the interface is the real power move.

Key Points

  • Changing only the edit tool to “Hashline” improved coding accuracy across 16 models, outperforming Patch in 14 and saving 20–30% tokens.
  • Largest reported gain: Grok Code Fast 1 accuracy increased from 6.7% to 68.3% (+61.6 percentage points).
  • The harness, not just the model, is a critical bottleneck affecting input tokens, output integration, and failure points.
  • Existing edit methods have drawbacks: Codex’s apply_patch fails often on non-OpenAI models; str_replace (Claude Code/Gemini) is brittle; Cursor trained a 70B model to merge edits.
  • Benchmarks (Aider, JetBrains Diff-XYZ, EDIT-Bench) show edit format can swing performance dramatically and no single format dominates.

Hottest takes

"The harness matters far more than most people think." — woeirua
"VI for LLMs" — rafaelmn
"Great work, but concurrency is lost." — pcwelder
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.