When high scores don’t mean high intelligence: how to build better benchmarks

Opinion: Researchers from the Oxford Internet Institute argue that lessons from social sciences can help us better assess the capabilities of AI

Dec 10, 2025

By Andrew M. Bean, Franziska Sofia Hafner, Ryan Othniel Kearns and Adam Mahdi

If you like puzzle games, you might be familiar with the LinkedIn messages saying that you’re “smarter than 95% of CEOs.” The website’s games rank players based on puzzle completion speed and compares the times to groups like “CEOs” and “others with your name”. Of course, racing someone at sudoku is a pretty unreliable way to determine who is smarter.

In AI research, we often make the same mistake with higher consequences. We point to benchmarks as evidence of the capabilities of LLMs without spending much time considering how we are making those measurements. Evaluations are central to AI development. They set the direction for research and tell us when we make progress. But current evaluations don’t tell us enough about what matters; benchmark scores keep improving while real world applications are much less effective.

Are AI scheming evaluations broken?

Nikita Ostrovsky

Sep 1

Read full story

In the social sciences, the degree to which a test provides evidence about the real world phenomenon it is intended to measure is called “construct validity.” If the ability to complete a sudoku puzzle quickly does not actually reflect what we mean by ‘intelligence’, then this test has poor construct validity as a measure for intelligence. Psychologists who developed the concepts of construct validity were worried about understanding when a human is “intelligent” or “extroverted,” concepts which are difficult to directly assess — much like the capabilities of modern AI.

Building on the language of construct validity gives us an idea of questions to ask when creating a benchmark. Are you worried about generalization to the real world? You can look into techniques for increasing ecological validity, so that an evaluation better reflects reality. Are you worried that the benchmark doesn’t exactly capture the right phenomenon? You can look into convergent validity, for example testing whether your benchmark aligns with other benchmarks of similar phenomena.

Drawing on these concepts, we recently led a large collaboration of researchers conducting a systematic review of more than four hundred benchmarks published at top machine learning and natural language processing conferences. We found that the vast majority of benchmarks reviewed had not followed best practices in at least one of five key areas we reviewed. While this sounds terrible, and is certainly a problem, we are optimistic that it can be addressed. AI evaluations have gone through a dramatic shift in the past decade, from measuring clean, well-defined phenomena like accuracy in recognizing hand written digits, to abstract and often vague phenomena like “reasoning” and “harmlessness”. The problems for evaluators have gotten dramatically more difficult, and the methods need to catch up.

Based on our literature review, we have eight concrete recommendations to increase the construct validity of evaluations.

Define the phenomenon

Starting with the basics, about half of papers didn’t provide a clear definition of what they were intending to measure. While terms like “harmlessness” may not have a single consensus understanding, having a definition at least lets future potential adopters of a benchmark know whether their use case is included. Combining a variety of precisely defined benchmarks strengthens the evidence for researchers hoping to make larger claims in the future.

Measure only the phenomenon

This is harder than it sounds. Most tasks, in benchmarks and applications, involve a combination of multiple capabilities. Instruction-following in particular can be a confounder here: if a model is bad at, for example, generating the JSON data formatting that is common in benchmarks, it could appear incapable of a wide range of other tasks. If the confounders are unavoidable, comparing to a baseline which only tests the confounding aspects of the task can help disentangle the different skills.

Construct a representative dataset for the task

Representation is usually a trade-off between completeness and cost, requiring assumptions about which types of generalisation should be tested in the data, and which can be assumed. In the case of LLMs, it is worth testing along dimensions where humans are (mostly) robust and models are often not. For example, changing a math problem by adding more digits to a number or swapping out unimportant context in a prompt may impact LLMs even when humans would be largely unaffected.

Acknowledge limitations of reusing datasets

Dataset reuse cuts both ways. Iterating on existing datasets helps to improve them. However, it also reduces the space of tasks represented across benchmarks and increases the risks of dataset contamination, where test data becomes part of the training dataset for new models.

Prepare for contamination

Evaluation contamination is basically inevitable at this point. Assuming that no one deliberately trains on benchmark test sets, the scale of pre-training datasets still means that thorough cleaning is unlikely. Creating a test dataset which is not closely related to some part of the training is difficult, but it’s important.

Use statistical methods to compare models

Surprisingly few current benchmarks actually use statistics to draw conclusions. We encourage conference reviewers to ask for researchers to provide uncertainty estimates as a part of any new benchmark in order to promote adoption of better practices.

Conduct an error analysis

Often the most informative part of a benchmark paper is an exploration of the failure cases. This can reveal promising paths for identifying biases in scoring as well as improving training.

Justify construct validity

Since benchmark creation involves balancing competing factors such as cost, representativeness and concept coverage, a concise explanation of the choices made improves the usefulness of the benchmark for future research. This can help users of a benchmark better understand its strengths and weaknesses, and determine when and where to use it.

As one of the academics who reviewed our study said, these recommendations are in some sense “well-known practices for writing a valid scientific paper” translated to AI evaluations. Sound evaluations are crucial to the continued development of our field, and construct validity is a key area for improvements.

Andrew Bean is the lead author of the recent review Measuring what Matters (NeurIPS 2025), a collaboration of more than forty researchers focused on improving construct validity in LLM benchmarks. He is a DPhil student at the Oxford Internet Institute, where he works on AI evaluations.

Ryan Othniel Kearns is a DPhil (PhD) student at the Oxford Internet Institute and a Cosmos Fellow. He researches evaluations of AI systems to discover and characterise the space of intelligent behaviour in AI.

Franziska Sofia Hafner is a DPhil student in Social Data Science at the Oxford Internet Institute. Her research focuses on algorithmic fairness, machine learning, and interactive data visualisation.

Adam Mahdi is an Associate Professor and Senior Research Fellow at the Oxford Internet Institute. He leads the Reasoning with Machines Lab, which focuses on the science of LLM evaluation, AI safety, agentic AI for science, and human–LLM interaction.

Are AI scheming evaluations broken?

Discussion about this post

Ready for more?