Learning from context is harder than we thought

AI aces exams but flunks 'read the manual', and the comments are on fire

TLDR: Tencent’s CL-bench tests whether AI can learn from new info on the fly, not just recall old training. Commenters split: some want coached specialization, others call it impossible or risky to obey “untrusted” user rules—highlighting the gap between benchmark brilliance and messy real-world work.

Tencent dropped a new benchmark, CL-bench, to see if AI can actually learn from what’s right in front of it—like a 70,000-word drone manual, a full game rulebook, or messy lab logs—rather than just reciting what it memorized. Cue the comment section cage match. One camp cheered the “train-while-you-work” vibe: bradfa imagined logging the AI’s attempts, coaching it with human feedback, and ending up with a specialty model that improves like a person. The other camp smashed the optimism button with a hammer: johnsmith1840 declared it’s currently an impossible one, turning “continual learning” (training on-the-fly) into the villain of the thread.

Then came the plot twist: rishabhaiover asked if in‑context learning—models learning from the prompt—wasn’t already a thing, sparking eye-rolls and throwbacks to last year’s “emergent behavior” hype. Meanwhile, joriJordan went full philosophy, arguing there’s no one true context, so testing “correctness” is kind of cosmic comedy. The most practical paranoia? XenophileJKO warned that CL-bench puts rules in user messages—aka untrusted context—and worried that training models to obey those blindly could be dangerous.

The memes wrote themselves: “A+ test-taker, F- in real life.” Someone joked the benchmark is basically “RTFM, but the model replies ‘TL;DR’.” Love it or loathe it, this matters: real work is chaotic, and if AI can’t learn from fresh info, it’s just a great student who can’t do the job.

Key Points

•The article argues that current language models rely on parametric knowledge and struggle to learn from context provided at inference time.
•It presents three real-world scenarios requiring context learning: parsing a 70K-word SDK, learning a game from a 15K-word rulebook, and analyzing 300 experiment logs.
•The authors identify a structural mismatch between benchmark-optimized “test-taker” models and user needs for context-adaptive systems.
•They call for a fundamental shift in optimization direction to build models that learn effectively from immediate context.
•The article links to CL-bench resources: project page, paper, code repository, and dataset to support evaluation and development of context learning.

Hottest takes

This is beyond a hard problem it’s currently an impossible one. — johnsmith1840

You end up with a specialty model in a given domain that keeps getting better at that domain, just like a human. — bradfa

I don’t know that I would necessarily want a model to pass all of these. — XenophileJKO

February 6, 2026

A+ student, F- in real life

AI aces exams but flunks 'read the manual', and the comments are on fire

Key Points

Hottest takes

February 6, 2026

A+ student, F- in real life

Learning from context is harder than we thought

AI aces exams but flunks 'read the manual', and the comments are on fire

Key Points

Hottest takes

Save News