March 14, 2026
Search it till you win
Tree Search Distillation for Language Models Using PPO
Tiny AI learns with a “search party” — commenters cry compute flex
TLDR: A small AI trained with a “search” trick improved at a number game, hitting 11.3% without extra helpers. Commenters are split: some cheer the AlphaZero vibes, others demand fair compute comparisons and real-world tests, asking if it’s just spending more energy to look smart.
An AI researcher ran a classic game trick on a tiny model: teach it to “think ahead” by searching through possible steps, then bake that skill into its brain. The result? On the number-crunching game Countdown, the upgraded 1.5B model now scores 11.3% without extra helpers — beating other training methods and the original model. But the community didn’t just clap; they brought popcorn.
The top hot take: “Isn’t this just more compute?” One reader, supermdguy, flagged the confusion: if the search happens during training and you distill it into the model, doesn’t inference stay the same? Translation: stop calling it faster just because you trained longer. Another voice, natufunu, wants receipts: how does pure search at test-time stack up against simple “best-of-N” with the same budget? Fans say this is AlphaZero-style thinking for words; skeptics say it’s “compute cosplay.”
Memes flew: “11.3% is still a D+,” “AI speed-running math with a walkthrough,” and “pUCT vs UCT is the nerdiest beef of 2026.” The author promises bigger models and more posts, while the comments demand fair, apples-to-apples comparisons and real-world tasks beyond puzzles. It’s equal parts brain upgrade and bench wars, and everyone’s yelling “show the receipts” link.
Key Points
- •The study applies MCTS over reasoning steps to Qwen-2.5-1.5B-Instruct and distills trajectories via an online PPO loop.
- •On the Countdown task, the distilled model (without a search harness) achieves mean@16 of 11.3%, outperforming CISPO (8.4%) and best-of-N (7.7%).
- •Compared to the pre-RL instruct model (3.1%), the approach yields an 8.2 percentage point absolute improvement.
- •GSM8K showed minimal differences between GRPO and MCTS, motivating the shift to the combinatorial Countdown environment (20k train, 820 test; four integers 1–13).
- •A dense reward was used for training due to instability with sparse 0/1 rewards; evaluation retained sparse rewards for interpretability. The MCTS implementation used Tree-of-Thoughts-style step nodes and parallel search with virtual losses.