May 8, 2026
Spec, Lies, and AI Videotape
Can LLMs model real-world systems in TLA+?
AI looked smart until readers asked: did it understand, or just copy the homework?
TLDR: Researchers found that AI can write formal-looking descriptions of software that compile and run, but often fail to match what the real program actually does. In the comments, readers split between “AI is improving fast,” “use tools tied to real code instead,” and “if the bot does the thinking, what’s the point?”
A fresh research post asked a deceptively simple question: can today’s AI chatbots actually describe how real software behaves, or are they just serving polished book-report energy? The team behind SysMoBench built a test that feeds models real system code and checks whether the formal write-up matches what the software actually does. And the big reveal is deliciously awkward: many models look great at first, producing neat, runnable specs, but then stumble when it’s time to prove they captured the real implementation instead of a famous textbook version.
The comment section immediately turned into a mini food fight. One camp basically yelled, “This is why you should tie the proof directly to the code,” with one reader calling the post an “accidental advertisement” for another verification approach. Another crowd was more impressed than scandalized, pointing out that Claude has gotten weirdly good fast — so good that someone casually had it model Monopoly “for laughs,” which is exactly the kind of chaotic benchmark the internet deserves. And then came the philosophical side-eye: if the whole point of writing a formal spec is to force humans to think clearly, what exactly are we doing if a bot writes the whole thing for us?
So yes, the paper is about software testing. But the comments make it feel like a reality show reunion: Team AI Assistant, Team Do It By Hand, and Team Are We Missing The Entire Point? all showed up ready to argue.
Key Points
- •The article investigates whether LLMs can produce TLA+ models that reflect real software implementations rather than reproducing known protocol specifications.
- •An example involving Claude and Etcd’s Raft implementation showed that a syntactically valid and runnable spec could still fail to capture implementation-specific behavior.
- •SysMoBench is introduced as an automated benchmark covering eleven systems in concurrent synchronization and distributed protocols.
- •SysMoBench evaluates generated specifications in four phases: syntax, runtime, conformance through trace validation, and invariant checking for safety and liveness.
- •The article reports that leading LLMs generally do well on syntax and runtime phases but show recurring problems during conformance evaluation.