Taming LLMs: Using Executable Oracles to Prevent Bad Code

Cage the Code Bots: Zero-Freedom Plan Has Devs at Each Other’s Throats

TLDR: Author says the fix for sloppy AI code is to cage it with nonstop automated tests and checks. Commenters split: some call it overkill (“just write it yourself”), others want AI-vs-AI reviews, while memes like “tested into existence” and Haskell jokes fly—because this fight decides how we build software next.

A new post argues the only way to stop AI code generators (so‑called large language models, or LLMs) from spitting out weird, broken code is to lock them in a box of nonstop tests—aka “executable oracles.” Think of it like a treadmill of automatic checkups: if the bot tries something dumb, the tests slap it down. The author says this could have saved a buggy AI-made C compiler and even helped another AI write smarter analysis code when squeezed between checks for safety and accuracy.

But the comments stole the show. One line—“zero degrees of freedom”—lit the thread on fire. Skeptic-in-chief dktoao asked if that’s just reinventing a new, complicated language and wondered if all this overhead means we should just write the code ourselves. Meanwhile, RS-232 pitched AI vs AI: one bot builds, another bot claws it apart in an adversarial review. Folks started chanting “Code Thunderdome” in spirit. Then the comedy: CrazyStat accused the post of “you sound like a bot,” while ReptileMan waved a Haskell flag—“now’s its time to shine”—because of course it is. And JSR_FDED turned a phrase into a meme: “tested into existence.”

So yes, the tech is serious—but the vibe? Half lab coat, half roast. Is caging our code bots genius guardrails or just bureaucracy in a trench coat? The crowd is very much not done yelling about it.

Key Points

  • LLM coding agents perform well on constrained tasks but can emit incorrect or low-quality code when given freedom.
  • The article advocates using executable oracles to constrain LLMs, arguing test cases alone are often insufficient.
  • Claude’s C Compiler passed GCC torture tests yet had 34 miscompilation bugs; tools like Csmith and YARPGen could have caught them.
  • Adding a code-quality oracle (e.g., instruction counts with gcc -O0 baseline) is proposed to guide better optimization.
  • With oracles for precision and soundness, Codex synthesized superior dataflow transfer functions; code size remained a controllable freedom.

Hottest takes

"Wouldn't that just be called inventing a new language" — dktoao
"2 agents, with one as the creator and one as an adversarial "reviewer"?" — RS-232
"Now is Haskell's time to shine." — ReptileMan
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.