Tree Search Distillation for Language Models Using PPO

Tiny AI learns with a “search party” — commenters cry compute flex

TLDR: A small AI trained with a “search” trick improved at a number game, hitting 11.3% without extra helpers. Commenters are split: some cheer the AlphaZero vibes, others demand fair compute comparisons and real-world tests, asking if it’s just spending more energy to look smart.

An AI researcher ran a classic game trick on a tiny model: teach it to “think ahead” by searching through possible steps, then bake that skill into its brain. The result? On the number-crunching game Countdown, the upgraded 1.5B model now scores 11.3% without extra helpers — beating other training methods and the original model. But the community didn’t just clap; they brought popcorn.

The top hot take: “Isn’t this just more compute?” One reader, supermdguy, flagged the confusion: if the search happens during training and you distill it into the model, doesn’t inference stay the same? Translation: stop calling it faster just because you trained longer. Another voice, natufunu, wants receipts: how does pure search at test-time stack up against simple “best-of-N” with the same budget? Fans say this is AlphaZero-style thinking for words; skeptics say it’s “compute cosplay.”

Memes flew: “11.3% is still a D+,” “AI speed-running math with a walkthrough,” and “pUCT vs UCT is the nerdiest beef of 2026.” The author promises bigger models and more posts, while the comments demand fair, apples-to-apples comparisons and real-world tasks beyond puzzles. It’s equal parts brain upgrade and bench wars, and everyone’s yelling “show the receipts” link.

Key Points

•The study applies MCTS over reasoning steps to Qwen-2.5-1.5B-Instruct and distills trajectories via an online PPO loop.
•On the Countdown task, the distilled model (without a search harness) achieves mean@16 of 11.3%, outperforming CISPO (8.4%) and best-of-N (7.7%).
•Compared to the pre-RL instruct model (3.1%), the approach yields an 8.2 percentage point absolute improvement.
•GSM8K showed minimal differences between GRPO and MCTS, motivating the shift to the combinatorial Countdown environment (20k train, 820 test; four integers 1–13).
•A dense reward was used for training due to instability with sparse 0/1 rewards; evaluation retained sparse rewards for interpretability. The MCTS implementation used Tree-of-Thoughts-style step nodes and parallel search with virtual losses.

Hottest takes

“So wouldn’t the model still have the same inference cost?” — supermdguy

“Did you compare MCTS against best-of-N with the same compute budget?” — natufunu

March 14, 2026

Search it till you win

Tiny AI learns with a “search party” — commenters cry compute flex

TLDR: A small AI trained with a “search” trick improved at a number game, hitting 11.3% without extra helpers. Commenters are split: some cheer the AlphaZero vibes, others demand fair compute comparisons and real-world tests, asking if it’s just spending more energy to look smart.

Key Points

Hottest takes

March 14, 2026

Search it till you win

Tree Search Distillation for Language Models Using PPO

Tiny AI learns with a “search party” — commenters cry compute flex

TLDR: A small AI trained with a “search” trick improved at a number game, hitting 11.3% without extra helpers. Commenters are split: some cheer the AlphaZero vibes, others demand fair compute comparisons and real-world tests, asking if it’s just spending more energy to look smart.

Key Points

Hottest takes

Save News