April 30, 2026
Creativity or just ad-core chaos?
The Human Creativity Benchmark – Evaluating Generative AI in Creative Work
AI judged on “creativity” and the comments instantly called it a design buzzword war
TLDR: Contra Labs says creative AI should be judged by both shared quality rules and personal taste, and current models don’t consistently handle both. Commenters were far less impressed, arguing the paper inflates ad-image design into “human creativity” and mostly shows AI can copy familiar visual norms.
A new Contra Labs study tried to do something very human: admit that in creative work, people don’t always agree. The paper says there are two different things happening when pros judge AI-made designs. Sometimes experts mostly agree on basics like readability and layout. Other times they split because it’s no longer about right or wrong — it’s about taste. That’s the big claim: today’s AI can sometimes make something competent, but it still struggles to be both reliably correct and meaningfully steerable toward a specific vibe.
But the real fireworks were in the community reaction, where readers immediately came for the name. One of the loudest complaints? This isn’t really a benchmark for all of human creativity at all — it’s mostly about ad-style visuals and product shots. Commenters basically yelled, “So… not poetry, not music, not painting — marketing assets?” That sparked a mini identity crisis over whether the paper is overselling a design test as a grand theory of creativity.
Then came the sarcasm. One popular dunk compared it to a “Turning test for Design,” with critics saying the study mainly proves AI can imitate the visual rules humans already use. In other words: congrats, the machine made something that looks professionally safe. The joke hanging over the thread was brutal but funny: AI didn’t unlock genius, it unlocked competent brand graphics. And honestly, that roast may have been more memorable than the benchmark itself.
Key Points
- •Contra Labs introduced the Human Creativity Benchmark in April 2026 as a framework for evaluating generative AI in creative work.
- •The framework separates evaluator agreement on shared best practices from evaluator disagreement driven by taste, aesthetic direction, and creative intent.
- •The article argues that standard evaluation methods such as majority voting and adjudication are poorly suited to creative tasks because they suppress meaningful disagreement.
- •Contra Labs says generative models tend to produce safe, averaged outputs under the same brief, which limits their usefulness in professional creative workflows.
- •The benchmark treats evaluation dimensions as ranging from objective to subjective, with examples including prompt adherence, usability, and visual appeal.