Tree Search Distillation for Language Models Using PPO

Tiny AI learns with a “search party” — commenters cry compute flex

TLDR: A small AI trained with a “search” trick improved at a number game, hitting 11.3% without extra helpers. Commenters are split: some cheer the AlphaZero vibes, others demand fair compute comparisons and real-world tests, asking if it’s just spending more energy to look smart.

An AI researcher ran a classic game trick on a tiny model: teach it to “think ahead” by searching through possible steps, then bake that skill into its brain. The result? On the number-crunching game Countdown, the upgraded 1.5B model now scores 11.3% without extra helpers — beating other training methods and the original model. But the community didn’t just clap; they brought popcorn.

The top hot take: “Isn’t this just more compute?” One reader, supermdguy, flagged the confusion: if the search happens during training and you distill it into the model, doesn’t inference stay the same? Translation: stop calling it faster just because you trained longer. Another voice, natufunu, wants receipts: how does pure search at test-time stack up against simple “best-of-N” with the same budget? Fans say this is AlphaZero-style thinking for words; skeptics say it’s “compute cosplay.”

Memes flew: “11.3% is still a D+,” “AI speed-running math with a walkthrough,” and “pUCT vs UCT is the nerdiest beef of 2026.” The author promises bigger models and more posts, while the comments demand fair, apples-to-apples comparisons and real-world tasks beyond puzzles. It’s equal parts brain upgrade and bench wars, and everyone’s yelling “show the receipts” link.

Key Points

  • The study applies MCTS over reasoning steps to Qwen-2.5-1.5B-Instruct and distills trajectories via an online PPO loop.
  • On the Countdown task, the distilled model (without a search harness) achieves mean@16 of 11.3%, outperforming CISPO (8.4%) and best-of-N (7.7%).
  • Compared to the pre-RL instruct model (3.1%), the approach yields an 8.2 percentage point absolute improvement.
  • GSM8K showed minimal differences between GRPO and MCTS, motivating the shift to the combinatorial Countdown environment (20k train, 820 test; four integers 1–13).
  • A dense reward was used for training due to instability with sparse 0/1 rewards; evaluation retained sparse rewards for interpretability. The MCTS implementation used Tree-of-Thoughts-style step nodes and parallel search with virtual losses.

Hottest takes

“So wouldn’t the model still have the same inference cost?” — supermdguy
“Did you compare MCTS against best-of-N with the same compute budget?” — natufunu
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.