Study: Self-generated Agent Skills are useless

Self‑taught bot skills flop; human cheat sheets crush

TLDR: New tests show human-written tips lift AI assistants’ success by 16 points, while self-made “skills” add no average benefit. Commenters split: some roast AI-on-AI stacking as a failure spiral, others say skills work when grounded in real tool docs and fresh info—a wake‑up call for auto‑learning hype.

Study drop: teaching chatbots to write their own “how‑to” notes didn’t help them at all—while hand‑crafted cheat sheets boosted success by an average of 16 percentage points, especially in healthcare. Bite‑sized tips (just 2–3 steps) beat wall‑of‑text manuals, and even smaller bots with good notes kept up with bigger ones. The crowd? Absolutely buzzing. One camp says this proves a long‑running meme: “skill issue,” but make it science. As one user put it, stacking AI on AI is just asking for chaos—“stop piping AI into AI.”

But the thread isn’t all I‑told‑you‑so. A pragmatic crew argues self‑written skills can work—if they capture how to use real tools: think tiny guides for command lines and APIs. “I treat them like mini CLAUDE.mds,” says one commenter, echoing a growing DIY vibe. Others defend the framework as a memory system, saying the real magic is a good “review your own work” prompt and saving the lessons—old‑school prompt engineering making a comeback. Meanwhile, a careful voice notes the study’s setup isn’t what most people mean by “AI generating skills”—writing notes from the same brain isn’t the same as learning new facts. The spicy closer: without fresh info, self‑taught skills are just a chatbot writing Post‑Its to itself. The discussion delivered peak drama, peak memes, and a rare reality check.

Key Points

  • SkillsBench benchmark evaluates Agent Skills across 86 tasks in 11 domains using deterministic verifiers.
  • Three conditions tested: no skills, curated skills, and self-generated skills.
  • Curated skills improve average pass rate by 16.2 percentage points, with wide domain variance.
  • Self-generated skills show no average benefit to performance.
  • Focused skills (2–3 modules) outperform comprehensive documentation; smaller models with skills can match larger models without skills.

Hottest takes

“the more layers you automate with LLMs, the worse each successive layer gets” — embedding-shape
“I treat them like mini CLAUDE.mds” — turnsout
“you’re just piping output to input without expanding the information in the system” — smcleod
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.