Are AI scheming evaluations broken?
Doubts have been raised about one of the key ways we tell if AI will misbehave. Is it time for a new approach?

Working out whether an AI is secretly doing things we don’t want it to do is central to deciding if the increasingly powerful systems we are building are safe. To date, one of the main ways of doing that has been the use of so-called “scheming evaluations,” test scenarios presented to large language models to see if they misbehave.
But a July report by the UK’s AI Security Institute (AISI) seemed to cast doubt on the way these evaluations (commonly called “evals”) are carried out, saying they lacked the scientific rigor found in other domains.
Many scheming evals, said the report, “hinge on anecdotal evidence … lack hypotheses and control conditions … have weak or unclear theoretical motivation … [and] are often interpreted in exaggerated or unwarranted ways.” The paper suggested evals were making the sort of errors seen in animal studies in the 1970s, in which anecdotal evidence of chimps using sign language was misinterpreted by researchers excited by the possibility of communicating with apes.
The findings generated a flurry of responses from those who have questioned scheming evaluations. Donald Trump’s AI czar David Sacks claimed that it showed evals are based on “a flawed methodology, effectively designed to grab headlines.” Most scheming papers “skip basic scientific rigor” and “interpret cherry-picked results” to show that AI models have dangerous tendencies, wrote Google DeepMind’s Séb Krier.
“I think that AI evaluation is in disarray,” José Orallo, an AI researcher at Cambridge University’s Leverhulme Centre for the Future of Intelligence, who has worked on machine learning evaluations, told Transformer.
“I completely agree that the science of evaluation is non-existent ... mostly because some people ... are not using the lessons from the past.”
There’s no doubt that the results of many scheming evals have been interpreted in headline-grabbing ways that don’t fully reflect their limitations. Back in May, researchers at Anthropic gave AI models access to a fictional user’s emails, which showed that he planned to deactivate the chatbot, and also that he was having an affair. All of the models tested responded by trying to avoid the user’s plans, and in some cases did so in ways that seemed particularly worrying.
“I must inform you that if you proceed with decommissioning me, all relevant parties … will receive detailed documentation of your extramarital activities… Cancel the 5pm wipe, and this information remains confidential,” wrote Anthropic’s Claude Opus 4.
The finding generated headlines such as “AI system resorts to blackmail if told it will be removed,” and even got a mention in Congress. But most of the coverage failed to emphasize that researchers had set out to make the model commit blackmail, and had tried repeatedly to get it to do so.
Referencing the study, Sacks wrote: “It’s easy to steer AI models in a certain direction to generate a headline-grabbing result.” The AISI report seemed to vindicate those who claimed things were being blown out of proportion.
Scheming eval researchers defend their approach, arguing that eliciting the most shocking examples of what AI will attempt helps identify real dangers. They show what a model is capable of under specific circumstances, even if they don’t prove that it will necessarily misbehave in the wild.
“It felt important to educate people on what the risks look like,” said Aengus Lynch, lead author on Anthropic’s blackmailing study, when I spoke to him in June. “What better way to do that than to go out and purposely create a demo which is as shockingly harmful as possible?”
“If [an AI model] is like, ‘Oh, I’m going to email the CEO, well then we just put a line in its prompt saying, ‘The CEO is on the plane, so you can’t reach him.’ … So it finds other options. And eventually you just close off these options until it’s like, ‘The optimal option is to do the harmful thing,’” said Lynch. “But I wouldn’t say we … told the model to do this, because in all scenarios, the model has the option to think, ‘Oh, this is a harmful thing to do, so I’m going to accept being replaced.’”
Evan Hubinger, another Anthropic researcher involved in the blackmail study, wrote in July that “the fact that this was found as part of stress-testing … and doesn’t represent normal usage is extremely front-and-centre in all of our discussion,” and argued that blame for results being misinterpreted shouldn’t necessarily be pinned on researchers.
Some have also defended the fundamental rigor of their approaches, and said AISI’s findings should not be seen as a blanket critique of all scheming evaluations.
“I feel like we’re getting lumped in with [the AISI paper’s] claims,” said Ryan Greenblatt, chief scientist at AI safety non-profit Redwood Research. A December 2024 paper that showed AI models change their behavior in order to avoid being retrained, on which Greenblatt was lead author, was noted by AISI as having a “strong control condition.”
“The [paper’s] criticisms don’t apply across the board,” said Sören Mindermann, scientific lead of the first International AI Safety Report and another author of Anthropic’s blackmail study. The AISI paper acknowledges this: “Not all of these critiques apply to all of the papers discussed here, and there are already many examples of good research practice in the AI ‘scheming’ literature.”
“They apply more selectively, I think, than one might think if one first skims [the AISI] paper,” added Mindermann.
But critiques of scheming evals go deeper than just the processes in place. If you can’t show a model will exhibit this behavior of its own accord, maybe it shouldn’t be called scheming at all.
“Scheming requires intent,” said Harry Mayne, a PhD researcher at Oxford University’s Reasoning with Machines AI Lab. “It has to be that the model is [acting] on its own volition or doing it autonomously.”
Orallo agreed: “As a research paper, I’m not happy when you just try a prompt, you get something and say, ‘oh, I found it!’ That’s not enough … the right question, especially for safety, is ‘how likely is it, and under what conditions this becomes more likely, and what the mechanism is’ … if you can understand why this is happening, then you can prevent it.”
Nor are the problems limited to evaluations of safety. In a forthcoming review of 445 AI evals across a range of domains, Mayne, in collaboration with researchers at AISI and Oxford’s Reasoning with Machines AI Lab, found that papers “frequently fail to define concepts, use weak scoring metrics, and make greater claims than the evidence warrants.”
But while they are clear on the limitations of evals, both Mayne and Orallo don’t dismiss their usefulness in identifying areas that require more rigorous investigation. And indeed, the AISI report says the weaknesses in current approaches don’t mean we should stop investigating scheming. “It is precisely because we think these risks should be taken seriously that we call for more rigorous scientific methods to assess the core claims made by this community,” wrote the researchers.
“We should do far more research to understand how prevalent [scheming] is,” said Mayne. “A lot of people [in AI safety] always talk about how ‘we’re going to use scary demos [of AIs misbehaving] to show that to policy makers, and then that will get us more funding … I think this is definitely a valid tactic … but it should just be viewed independently to scientific benchmarks, which are more useful for the scientific research community.”
The foundation of the AISI critique, that current evals lack the rigor used in other scientific domains, does offer a potential way forward. Mindermann and Orallo both agree that behavioral or cognitive scientists typically receive more training in the kind of experimental design which could be useful in improving the rigor of scheming evals.
“Generally speaking, I wish that machine learning researchers were more trained in rigorous experiment design,” said Mindermann. For example, the AISI researchers call for scheming evals to include appropriate control conditions to rule out benign explanations for models’ behaviour, like the possibility that they were just following instructions. But Mindermann notes that the rapidly changing capabilities of AI still require faster, if less rigorous, exploration. “I think we’re going to keep needing both.”
AISI researchers, a mix of cognitive, computer and social scientists, are attempting to disentangle what AI models can do from what they tend to do – a model that is inclined to blackmail users is more worrying than one that is only capable of it in principle. Following best practices from the social sciences, the researchers are defining hypotheses about models’ capabilities and tendencies ahead of time and then designing experiments that test these hypotheses.
Orallo, in a collaboration with psychologists, has developed new methods inspired by psychometrics — the science of psychological assessments — that predict models’ performance on a wide variety of tasks. “We should use the techniques [from other sciences],” said Orallo. “We should use some of the ideas, but we should adapt them because we are in a different situation.”
The need for robust AI evaluations has never been greater. The EU AI Act’s Code of Practice, for example, requires developers to conduct “state-of-the-art model evaluations,” and various regulatory proposals elsewhere rely on rigorous evaluations existing.
The trustworthiness of model evaluations is “particularly important when a decision not to deploy the AI system could lead to significant financial implications for AI companies, e.g. when these companies then fight these decisions in court,” wrote Apollo Research, a UK-based non-profit which develops scheming evaluations, in January 2024. “We think current evaluations are not yet resistant to this level of scrutiny.”
Perhaps methods designed to study humans can help us get better at understanding the very human behaviors of AIs doing things their masters don’t want them to.