November 5, 2025
Powered by oops
Learning from Failure to Tackle Hard Problems
Can AI learn from flops? Hype, hope, and eye-rolls
TLDR: Researchers propose BaNEL, training AI by learning only from wrong answers when right ones are rare and costly. Commenters split between calling it buzzword hype, asking if it’s just penalizing bad outputs, and urging teams to engineer cheap, frequent, visible failures so the idea actually works.
A new research push called BaNEL (think: teaching AI by only telling it what’s wrong) promises to tackle super-hard tasks where good answers are rare and expensive to check. The pitch: when success is near-zero, learn from the mountain of failures and stop wasting money on tests. Cue the comments section chaos. One camp loves the grit, cheering the idea of “don’t make the same mistake twice.” Another camp rolls its eyes at mixing drug discovery and theorem proving in one breath, calling it grant-bait buzzword soup. Skeptics ask whether BaNEL is just a fancy way to “reduce bad outputs,” while pragmatists demand real-world knobs: make failures cheap, frequent, and visible — or you won’t learn a thing. The memes? People joked BaNEL is like “AI raised by strict parents: only told what not to do,” and compared the slogan to relationship advice: “Do not make the same mistake twice.” Spicy, but fair. Whether you’re rooting for BaNEL or calling it “random search in a tux,” the thread is pure drama: hope vs. hype, theory vs. practice, and a lot of folks asking if clever math can beat a low success rate without burning cash.
Key Points
- •The article introduces BaNEL, an algorithm that post-trains generative models using only negative rewards to handle extreme reward sparsity while minimizing reward evaluations.
- •It identifies two core challenges for hard problems: near-zero success probabilities for base models and costly/risky reward evaluations.
- •Standard post-training methods like policy gradients (including GRPO) yield zero gradients under zero rewards, reducing learning to random search.
- •Novelty-bonus methods (count-based exploration, random network distillation) can learn under sparsity but require many reward evaluations and underperform.
- •The authors argue that learning from failures alone is feasible and necessary for tasks where positive examples are not encountered during training.