Silicon & Photonics|27 September 2025|10 min read

The Benchmark Illusion

Why the best model isn't always the one that scores highest

SS

Sajad Saleem

the mediocre generalist

When I was sixteen, I scored 94% on a mock A-level physics exam. Highest in the class. My teacher, Mr Henderson — a wiry man with a permanent expression of mild disappointment, as though the universe had let him down in some specific but unnameable way — pulled me aside afterwards and said: "Good mark. Now tell me what a magnetic field actually is."

I couldn't. Not really. I could calculate the force on a charged particle moving through one. I could draw the field lines with my eyes closed, probably in my sleep, almost certainly while eating a sandwich. But I couldn't tell you, in plain language, what a magnetic field is. What it means physically, intuitively, in the world you can touch.

94%. Understood roughly 40% of the subject. The benchmark said genius. Reality said: competent pattern-matcher with good exam technique and a decent memory.

I think about Mr Henderson a lot these days. The AI industry has a benchmark problem, and it looks almost exactly like my physics exam.

The tyranny of the leaderboard

If you've spent any time in the AI space — and by "any time" I mean roughly fifteen minutes on Twitter (I refuse to call it X; this is a matter of principle and I will die on this particular hill) — you've encountered The Leaderboard. MMLU. HellaSwag. ARC. HumanEval. GSM8K. GPQA. A forest of acronyms, each representing some standardised test that models are measured against, ranked on, and marketed by.

Model X scores 92.4% on MMLU! Model Y achieves state-of-the-art on HellaSwag! Model Z breaks the record on GPQA Diamond!

Numbers go up. They always go up. Every quarter, new models arrive with higher scores, press releases are issued, Twitter threads appear with fire emojis and rocket ships, and everyone gets very excited about a number that tells you surprisingly little about whether the model will actually be useful for the thing you need it to do. It's like choosing a life partner based on their SAT scores. Technically informative. Practically useless.

Benchmarks measure what benchmarks measure. This sounds tautological because it is. The mistake — pervasive, industry-wide, persistent — is assuming that what benchmarks measure is the same thing as what matters.

What MMLU actually tells you (and what it doesn't)

MMLU — Massive Multitask Language Understanding — is perhaps the most cited benchmark in the field. Multiple-choice questions across 57 subjects, from abstract algebra to world religions.

If a model scores high on MMLU, you can be fairly confident it has broad knowledge and can reason through multiple-choice questions about that knowledge.

But when was the last time you used an AI model to answer multiple-choice questions?

I use AI models every day. Writing code, brainstorming architecture, debugging problems, drafting communications, thinking through complex decisions, exploring ideas I don't fully understand yet. Not once — not a single time — has any of those tasks resembled a multiple-choice exam. Life doesn't come with a mark scheme.

What I actually need AI for is open-ended. Messy. Ambiguous requirements, shifting constraints, no clearly correct answer. I need the model to ask clarifying questions, make judgment calls, balance competing priorities, know when good enough is good enough and when perfection is worth the extra hour. I need everything a multiple-choice test is specifically designed to avoid testing.

MMLU measures none of this. Can't, by design. And yet MMLU scores are treated as a proxy for general intelligence in a way that would make Mr Henderson shake his head with that familiar, weary disappointment. Benchmarks aren't useless — they're a necessary sanity check, a way to ensure baseline competence. But they've become the metric rather than a metric, and the distance between those two positions is where all the mischief lives.

The HellaSwag problem, or: why common sense is hard to test

HellaSwag tests commonsense reasoning. Give the model a scenario, ask it to predict what happens next from a set of options. Models now score above 95%, which sounds impressive until you realise that scoring 95% on HellaSwag doesn't mean the model has common sense.

It means the model is good at HellaSwag. Full stop.

Common sense — actual, real-world, I-wouldn't-stick-my-hand-in-that common sense — is contextual, embodied, experiential. It's knowing that "I could eat a horse" doesn't require a trip to the butcher's. It's understanding that "first thing Monday" means something different in London than in San Francisco. It's intuiting that when your colleague writes "interesting approach" in a code review, what they actually mean is "I hate this but I'm being polite and I'd like you to delete it before anyone else sees it."

No benchmark tests for the ability to agree to disagree over coffee. Perhaps one should.

The rise of vibes-based evaluation

Here's my contrarian claim, and I'll state it plainly: vibes-based evaluation isn't a fallback for when you can't measure properly. It's the only evaluation that matters for real work. Benchmarks are a cope for people who can't trust their own judgment.

I understand why that's an uncomfortable position. We like numbers. Numbers feel rigorous. Defensible. You can put them in a slide deck and nobody argues. But rigour and numbers are not the same thing, and confusing them is how you end up optimising for the wrong target with exquisite precision.

When you sit with a model for four hours on a real project — not a benchmark, not a test, but actual work with actual stakes — you develop a feeling for how good it is. You notice things no benchmark captures. How it handles ambiguity. Whether it asks good questions or ploughs ahead with assumptions. Whether its code is merely correct or actually elegant. Whether working with it feels like collaboration or dentistry.

A model's "vibe" is an aggregate signal integrating hundreds of micro-observations your conscious mind can't individually articulate. It's not unscientific — it's pre-scientific. Pattern recognition that precedes formal measurement. And it's often more accurate than the formal measurement that follows. Your nervous system is running a benchmark with a million parameters, weighted by everything you've ever needed a model to do. No leaderboard replicates that. No leaderboard can.

I've used models that score within a percentage point of each other on every major benchmark and feel completely different to work with. One is precise but sterile — correct answers delivered with the warmth of a tax return. The other is equally precise but alive. It engages with your problem, pushes back when your approach is flawed, offers alternatives you hadn't considered, and occasionally makes an observation so sharp you have to put your chai down and think for a moment.

Benchmarks say equivalent. Vibes say different leagues. I trust the vibes. Every time. And I think the people who dismiss vibes-based evaluation are mostly people who've never sat with a model long enough to develop any.

The Goodhart trap

Goodhart's Law: "When a measure becomes a target, it ceases to be a good measure." Originally about monetary policy. Applies to AI benchmarks with almost eerie precision.

Once MMLU became the metric everyone optimised for, it became less useful as a metric. Not because models got worse — they got better! At MMLU. Specifically at MMLU. Labs started training in ways that boosted benchmark scores, sometimes consciously, sometimes through the gravitational pull of incentives that nobody explicitly designed but everyone obeyed. Data contamination crept in. The industry's incentive structure — where funding, press coverage, and customer acquisition all flow toward the highest number — created a pull toward benchmark optimisation that was nearly impossible to resist.

The map became more important than the territory. We're navigating by a chart that looks detailed and precise, and we're so busy admiring it that we've stopped looking out the window. Occasionally we crash into things. We blame the things.

I'm not accusing anyone of fraud. The problem is structural, not moral. When your entire ecosystem rewards a specific number, everything bends toward that number. That's not conspiracy. That's just incentive physics. Gravity doesn't need a motive.

Key Insight

The Goodhart trap is particularly vicious in AI because the stakes are high and the feedback loops are fast. A lab publishes a number. The press amplifies it. Investors reward it. Customers buy based on it. The lab optimises harder. The number goes up. The actual capability — the messy, hard-to-measure, useful-in-the-real-world capability — may or may not follow. Nobody checks. The number is the story. The territory is an afterthought.

The evaluation evolution

The field is evolving. Slowly, painfully, with much debate and occasional acrimony, but evolving.

Chatbot Arena introduced a radical idea: letting actual humans evaluate models by using them. Blind comparisons. Real tasks. Elo ratings derived from preference rather than performance on predetermined questions. Messy, subjective, statistically noisy — and much closer to what actually matters than any multiple-choice benchmark. The most useful measurement of a restaurant isn't its health inspection score. It's whether people come back.

Then there are task-specific evaluations. SWE-bench for coding. GAIA for real-world assistant tasks. MATH for mathematical reasoning. These are better because they test models on things people actually use models for. When a model scores high on SWE-bench, that tells me something I care about: it can probably help me fix bugs and implement features in real codebases. That's a benchmark I can act on. One that pays rent.

There's a growing movement toward holistic evaluation too — assessing models not on isolated tasks but on sustained, multi-step, real-world workflows. Can it help you build a feature over the course of an afternoon? Can it maintain coherence across a long conversation? Can it adapt when requirements change halfway through, which they always do, because requirements are written by humans and humans are beautifully, infuriatingly inconsistent?

Harder to measure. Harder to put on a leaderboard. These are the evaluations that actually matter.

The taste question

I've written elsewhere about taste — the quality that separates a model that's merely correct from one that's good to work with. Taste is the ultimate anti-benchmark quality. You can't test for it. Can't score it. Can barely define it. But you know it the instant you experience it, the way you know good prose from competent prose before you can articulate why.

A model with taste suggests the right abstraction, not just the right answer. It knows when a solution is over-engineered. It knows when brevity serves better than completeness. Correctness is the floor. Taste is the ceiling. Most models live on the floor and think they've arrived.

A model without taste gives you what you asked for. A model with taste gives you what you meant. The gap between those two things is where all the value lives, and no multiple-choice test on earth can measure it. Benchmarks test whether a model can parse what you asked for and return the right answer. They cannot test whether it understands what you meant. And understanding what someone means is the entire job. This is why the best engineers aren't the ones who implement specs faithfully — they're the ones who push back on specs that would faithfully implement the wrong thing.

The path forward, or: in defence of informed subjectivity

So where does this leave us? Benchmarks that are useful but insufficient. Vibes that are valid but unscalable. A gap between measurement and reality that the industry is slowly, grudgingly learning to acknowledge.

The path forward involves several things, none of them easy:

First, better benchmarks. Not more benchmarks — God knows we have enough. Better ones. Benchmarks that test models on real tasks, in real contexts, with real ambiguity. SWE-bench is a start. We need equivalents for every major use case, designed by people who actually do the work, not by people who study the people who do the work.

Second, honest reporting. Labs should publish benchmark results alongside real-world performance data, including failure cases. Tell me where your model struggles, not just where it shines. A model that admits its weaknesses earns more trust than one that only shows its strengths. This is true of people too, incidentally. And of dating profiles.

Third — and this is the one the industry will find hardest to swallow — legitimising subjective evaluation. Not as a replacement for quantitative measurement, but as its equal. Chatbot Arena is a start. We need structured frameworks for qualitative assessment that are rigorous without pretending to be objective. Because they're not objective. They can't be. And that's fine. The best decisions I've ever made had data in one hand and gut feeling in the other. The worst decisions I've ever made were the ones where I ignored the gut because the spreadsheet looked clean.


Mr Henderson retired a few years ago. I don't know if he's aware of AI's current moment. But I think he'd recognise the problem immediately. He spent his career teaching students that scoring well on a test and understanding the subject were two entirely different things, and that confusing them was the most dangerous mistake you could make.

We're making that mistake now. At scale. With technology that matters a great deal.

The benchmark illusion isn't about the benchmarks being wrong. It's about us forgetting what they're for. A benchmark is a torch. It illuminates a small area brightly. But the room is vast, and most of it is in shadow, and the things that matter most — taste, judgment, the ability to understand what someone means rather than what they said — live precisely where the torch doesn't reach.

The best model isn't the one that scores highest on tests designed to be easy to score. It's the one that, when you sit with it at midnight and work and think, makes you feel like you're not working alone.

Only benchmark for that is the one between your ears.

Not everything that counts can be counted, and not everything that can be counted counts.

— William Bruce Cameron