RL is more information inefficient than you thought

Is RL just guessing while pretraining gets all the answers? Commenters clash

TLDR: The article says RL gives sparse yes/no feedback, making it less efficient than token-by-token pretraining. Commenters clap back, debating definitions, arguing RL can mimic supervised updates and penalize mistakes, and asking what “inefficient” even means—because this fight decides how we spend massive compute budgets.

The post claims reinforcement learning (RL) is a stingy teacher: you spend tons of compute on a long “thinking” rollout and get a simple yes/no reward, while classic pretraining (next‑word prediction) hands you rich, token‑by‑token feedback. In plain English, supervised learning feels like a firehose of hints, RL like an eyedropper. The author argues that per sample, you learn fewer bits in RL unless your pass rate is near a coin flip, and scaling laws mean each improvement costs similar compute at every step. Cue the crowd: chaos, memes, and math fights.

Commenters swarmed in. First, a clarification drop: “RL means Reinforcement Learning, folks,” noted one. Then came the spice. scaredginger nitpicked the framing—pretraining isn’t really “supervised”—while bogtog blasted the premise that RL is just a single‑bit verdict, pointing to nuance beyond yes/no and linking Toby Ord’s post. macleginn argued that in the “happy” case (positive reward), policy gradients update almost like supervised learning, and in the “unhappy” case, RL can penalize specific bad choices, which supervised can’t. derbOac asked the existential question: “inefficient about what?” Meanwhile the thread birthed memes: coin‑flip RL, “firehose vs eyedropper,” and jokes about the model guessing “halcyon” for “The sky is…”. The vibe: pretrain to boost base smarts, then use RL for goals—while everyone argues over the bill for compute and the definition of “efficient.”

Key Points

  • RL requires long-token trajectories to obtain a single reward, while supervised learning provides per-token signals.
  • Efficiency is framed as Bits/FLOP = Samples/FLOP × Bits/Sample, with RL having lower Bits/Sample during most of training.
  • Supervised learning yields −log(p) bits per sample; RL’s binary feedback yields at most the Bernoulli entropy.
  • With a random model, the pass rate is about 1/vocabulary size, making RL signals sparse and learning inefficient early.
  • Power-law scaling suggests equal compute to improve pass rate by each order of magnitude, making RL’s apparent advantages misleading early on.

Hottest takes

"Bit of a nitpick, but I think his terminology is wrong" — scaredginger
"RL involves just 1 bit of learning for a rollout, rewarding success/failure" — bogtog
"kind of avoids the question of "information inefficient about what?"" — derbOac
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.