ProgramBench: Can Language Models Rebuild Programs from Scratch?

AI was asked to rebuild software from scratch — and the comments got messier than the code

TLDR: Researchers tested whether AI could recreate full software projects from only the finished program and instructions, and none of the models truly pulled it off. The comments instantly split into camps: blame the test, praise “agent swarms,” or argue over whether giant one-file code is secretly the future.

The big reveal from ProgramBench: today’s code-writing AI still can’t rebuild real software from scratch. Researchers threw 200 challenges at nine language models, ranging from tiny command-line tools to huge everyday programs like SQLite, FFmpeg, and even the PHP interpreter. The result? Not one model fully solved a single task, and the best performer only got close on a tiny sliver of them. The most eyebrow-raising detail: the bots kept spitting out giant one-file Franken-programs instead of the neat, human-style projects people expected.

And honestly, the comments were ready for blood. One person raced in with the preemptive eye-roll: “In before ‘but they did not use my agent swarm’” — basically calling out the inevitable crowd who think one more layer of AI managers would magically fix everything. But that exact debate exploded anyway. Some insisted the test was unfair because the researchers didn’t use teams of helper bots, planners, and reviewers. Others were way more interested in the accidental confession that maybe AI actually likes messy mega-files. One commenter casually admitted their own company code is already huge and monolithic, while another fired back with a strict 650-line limit like a digital parent setting curfew.

Then came the sci-fi chaos: one poster imagined a future where AI skips coding altogether and just hands you a finished binary for your computer chip. Translation: fewer developers, more prompt-whisperers. So while the paper says AI isn’t replacing software engineers yet, the crowd turned it into a full-on food fight over whether the future is agent swarms, single-file spaghetti, or just pressing a button and praying.

Key Points

  • ProgramBench is a benchmark created to test whether software engineering agents can reconstruct entire software projects rather than solve isolated coding tasks.
  • In the benchmark, agents receive only a program and its documentation and must build a codebase that matches the reference executable's behavior.
  • Evaluation is based on end-to-end behavioral tests generated through agent-driven fuzzing, without requiring a prescribed implementation structure.
  • The benchmark includes 200 tasks ranging from compact CLI tools to widely used software such as FFmpeg, SQLite, and the PHP interpreter.
  • Across 9 evaluated language models, none fully solved any task; the best model passed 95% of tests on only 3% of tasks, and models often produced monolithic single-file code.

Hottest takes

"In before 'but they did not use my agent swarm'" — vatsachak
"all of our code is monolithic with some files close 20K lines of code" — _pdp_
"We have a lint that caps source code files at 650 LOC" — luca-ctx
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.