February 6, 2026
A+ student, F- in real life
Learning from context is harder than we thought
AI aces exams but flunks 'read the manual', and the comments are on fire
TLDR: Tencent’s CL-bench tests whether AI can learn from new info on the fly, not just recall old training. Commenters split: some want coached specialization, others call it impossible or risky to obey “untrusted” user rules—highlighting the gap between benchmark brilliance and messy real-world work.
Tencent dropped a new benchmark, CL-bench, to see if AI can actually learn from what’s right in front of it—like a 70,000-word drone manual, a full game rulebook, or messy lab logs—rather than just reciting what it memorized. Cue the comment section cage match. One camp cheered the “train-while-you-work” vibe: bradfa imagined logging the AI’s attempts, coaching it with human feedback, and ending up with a specialty model that improves like a person. The other camp smashed the optimism button with a hammer: johnsmith1840 declared it’s currently an impossible one, turning “continual learning” (training on-the-fly) into the villain of the thread.
Then came the plot twist: rishabhaiover asked if in‑context learning—models learning from the prompt—wasn’t already a thing, sparking eye-rolls and throwbacks to last year’s “emergent behavior” hype. Meanwhile, joriJordan went full philosophy, arguing there’s no one true context, so testing “correctness” is kind of cosmic comedy. The most practical paranoia? XenophileJKO warned that CL-bench puts rules in user messages—aka untrusted context—and worried that training models to obey those blindly could be dangerous.
The memes wrote themselves: “A+ test-taker, F- in real life.” Someone joked the benchmark is basically “RTFM, but the model replies ‘TL;DR’.” Love it or loathe it, this matters: real work is chaotic, and if AI can’t learn from fresh info, it’s just a great student who can’t do the job.
Key Points
- •The article argues that current language models rely on parametric knowledge and struggle to learn from context provided at inference time.
- •It presents three real-world scenarios requiring context learning: parsing a 70K-word SDK, learning a game from a 15K-word rulebook, and analyzing 300 experiment logs.
- •The authors identify a structural mismatch between benchmark-optimized “test-taker” models and user needs for context-adaptive systems.
- •They call for a fundamental shift in optimization direction to build models that learn effectively from immediate context.
- •The article links to CL-bench resources: project page, paper, code repository, and dataset to support evaluation and development of context learning.