Program-of-Thought Prompting Outperforms Chain-of-Thought by 15%

AI writes code to solve math; commenters shout 'old news' and 'unsafe'

TLDR: PoT makes AI write code that an external computer runs, beating step-by-step thinking by about 12–15%. Commenters say it’s old news already in tool-enabled systems, while others warn that executing AI-written code is risky and needs sandboxes—accuracy rises, but safety steals the spotlight.

Move over handwritten “thinky” steps — the new flex is making the AI write tiny programs. The paper behind the buzz, arXiv:2211.12588, says Program‑of‑Thoughts (PoT) beats Chain‑of‑Thought by roughly 12% by letting an external computer execute the code the model writes. The headline touts 15%, and the comments immediately turn it into a math fight.

One camp yells “old news!”: jey says modern chatbots already do this when code execution is on, pointing to Anthropic’s Advanced Tool Use and mgraczyk drops “Programmatic Tool Calling.” axiom92 waves receipts with the earlier PAL approach, and jhart99 reminds everyone this paper launched in 2022, not today. The vibe: “PoT is great, but stop acting like it just landed.”

Then comes the spice: eric‑burel calls it “self‑destructive prompting” and warns that “running generated code is usually unsafe.” Cue debates about sandboxes and whether agent frameworks actually keep the knives in the drawer. Meme brigade arrives with “This is just Excel with vibes,” “CoT is journaling; PoT is sending it to the calculator,” and “Skynet, but it can’t leave the sandbox.” Fans love the accuracy boost; skeptics ask if the safety tax kills the hype. Either way, the crowd agrees: give the bot a calculator, but watch its hands.

Key Points

  • The paper introduces Program of Thoughts (PoT) prompting to separate reasoning from computation for numerical reasoning tasks.
  • PoT uses language models (mainly Codex) to generate executable programs; computation is performed by an external executor.
  • Evaluations span five math word problem datasets and three financial QA datasets under few-shot and zero-shot setups.
  • PoT achieves an average ~12% performance gain over Chain-of-Thought; with self-consistency it reaches SoTA on math and near-SoTA on financial datasets.
  • Data and code are publicly released on GitHub; the work is published in TMLR 2023 and hosted on arXiv.

Hottest takes

"Underlying paper is from 2022 and should be indicated in the title" — jhart99
"This seems to be incorporated into current LLM generations already" — jey
"running generated code is usually unsafe" — eric-burel
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.