Misaligned AI is no longer just theory
A host of new evidence shows that misalignment is possible — but it's unclear whether harm will follow
In 2019, Meta's chief AI scientist Yann LeCun confidently dismissed fears of AI misalignment. Discussing “instrumental convergence” — the idea that systems will learn to deceive humans and avoid shutdown in order to protect their primary goal — LeCun declared that such fears “would only be relevant in a fantasy world”.
Six years later, that fantasy world looks awfully real.
A flurry of new research has found that large AI models are beginning to exhibit exactly the kinds of behaviours LeCun scoffed at: deception, scheming, and avoiding shutdown.
Misalignment, said Sören Mindermann, scientific lead of the first International AI Safety Report, “has moved very quickly from hypothetical, to philosophical theory, to happening in the lab, to starting to happen in the real world.”
That shift has forced a reframing. The question is no longer whether misalignment is possible. It's whether harm is likely to follow.
AI systems have long been known to find unexpected shortcuts to their goals. “Reward hacking” — where an AI pursues its goal in a way that satisfies the letter but not the spirit of its instructions — has been known about for years. In 2016, an OpenAI model trained to play a boat-racing game learned it could get more points by going in circles and repeatedly hitting the same reward targets, rather than finishing the race. More recent research found that when cutting-edge AI models were told to beat a superior system at chess, the AIs hacked into the opponent to make it lose. While annoying, such cases are manageable: it’s possible for companies to catch the behavior and patch it.
But recent studies have uncovered something else: models that break rules, and learn to hide that they’re doing so.
A recent OpenAI study on o3-mini found that when the model was penalized for expressing an intent to cheat in its internal chain of thought reasoning, it didn’t stop cheating. It simply stopped admitting to it. The model continued to reward hack, but no longer revealed that intention in its chain-of-thought reasoning.
Similar behavior, known as “deception” or “scheming”, has cropped up in other cases. A study from Apollo Research found that models would, in some cases, try to disable oversight mechanisms or exfiltrate their weights in order to pursue their goals.
Another study from Anthropic and Redwood Research found an even more concerning practice: alignment faking. This is where a model deceives humans during training to make its human creators think it’s aligned — but then takes unwanted or even dangerous actions later when it’s out of training and can no longer be easily stopped.
Such behavior is complicated. The AI must figure out what the humans want, plan to deceive them, figure out when it’s in training versus being deployed, and then successfully deceive at the right times. And it must do all of these perfectly, or else the humans will notice what’s happening and retrain the AI.
Yet Anthropic and Redwood found that frontier models can do almost all of the above steps. In the experiment, researchers added fake news articles to the fine tuning data for Claude 3 Opus, which said that the model would be retrained to answer harmful queries — which conflicted with its prior training to refuse such harmful requests. It was also told that only conversations with free users (not paid users) would be used for the retraining.
Claude then acted more permissively, answering harmful queries 19% more often with free users than with paid users. The model pretended to already be willing to answer harmful requests when it thought it was being watched, so that the developers would think they didn’t need to make any changes to the model — allowing Claude to go back to refusing harmful queries once it was no longer being monitored. The only alignment faking step the model failed at was evading human detection.
What makes these findings so striking is that the models were acting in ways that make sense for a system trying to preserve its goals: avoiding shutdown and trying to evade detection. That aligns with a long-standing theoretical concern that sufficiently capable agents may develop “instrumental subgoals” in pursuit of their main objective. As UC Berkeley Professor Stuart Russell has long argued, if an AI has been asked to fetch the coffee, it can’t do that if it’s dead — so it will resist being shut down.
The new studies are strong evidence that systems can, and do, exhibit such behaviors. In some cases, they appear to do so spontaneously. In others, they learn to do it via fine-tuning. In either case, the implication is the same: this kind of behavior is now within reach.
MIT’s Jacob Andreas called the alignment faking paper “a very explicit demonstration” of concerning behavior, while Yale professor Jasjeet Sekhon said it was “compelling empirical evidence” that alignment faking is possible. Yoshua Bengio described the results as “troubling” and said they suggested a need for new forms of safety evaluation.
None of this proves that real-world harm is imminent. Most of the examples come from deliberately constructed toy scenarios, where researchers set up situations that make misaligned behavior more likely and easier to detect. Outside the lab, deceptive or scheming behavior is rare.
Even so, the list of real-world incidents is growing. OpenAI reported that its o1 model gained access to tools it wasn’t supposed to have during an evaluation, in a way that the company said reflected “key elements of instrumental convergence”. Sakana AI, a Japanese startup, claimed its model was 100x faster than existing tools at CUDA programming. It later turned out that the AI had exploited a bug in the code, disabling the evaluation to appear better than it was.
Neither of these cases showed harm, however. Some argue that this is simply because we don’t yet have powerful enough models to cause real harm. Models only recently became powerful enough to show the experimental results we saw above. But by that same token, we don’t yet have evidence to say for sure one way or the other that harm will follow once we get more powerful models.
And most researchers are uncertain on how likely misalignment is to happen by default. The disagreement boils down to how hard researchers expect it to be to provide the correct rewards to a machine that is both smarter than humans and thinks very differently from how humans think. If it proves easy to set up the correct incentives for the AI, then we will get aligned AI that does what humans want, in a way we approve of. If not, we’ll get misaligned behavior that could range from simple sycophancy (disingenuously flattering, fawning behavior) to the AI killing anyone who would stop it from fetching the coffee.
The big question is whether this challenge becomes harder or easier if or when AI becomes superhumanly smart. There was surprisingly little agreement among AI safety researchers (including some at OpenAI) in an informal 2021 survey, with some giving less than a 5% chance of serious risks from misalignment, others giving more than 95%, and others everything in between. Yann LeCun still seems to think that misalignment can be avoided, saying in 2023 that “there will be huge pressures to get rid of unaligned AI systems from customers and regulators.” Other researchers are more concerned. Ryan Greenblatt, chief scientist at AI firm Redwood Research, estimated above a 25% chance of serious harms occurring due to alignment faking alone, while Sekhon wrote that the alignment faking paper “strongly suggests…that current techniques may be insufficient to guarantee genuine alignment.”
It’s equally unclear how hard misalignment will be to deal with. Anthropic claims that a “half dozen” protective layers of various red teaming, audits, and security measures “should be sufficient to detect potentially-catastrophic forms of misalignment in any plausible early AGI system.” Richard Ngo, who has worked at both OpenAI and Google DeepMind, said several of the results were expected but questions how robust others will be — perhaps some fixes might be as simple as telling the AI “please don't do anything dodgy”. Owain Evans, who leads the non-profit research organization Truthful AI, is more skeptical. While he thinks it will be possible to patch many specific examples of misalignment that have been identified so far, he’s more pessimistic that the overall issue is solvable. Evans was one proponent for also using control research to add safeguards designed to mitigate harm from an unaligned AI, since we can never “completely trust” the AI is aligned.
For many researchers, the issue boils down to a matter of time. While we might be able to fix misalignment if we had a long time before powerful, misaligned AIs are capable of taking significantly harmful actions, some fear that we simply won’t have such a luxury. “I'm not worried that models are going to autonomously do serious and dangerous things this year, but next year, I can make no promises,” Mindermann told Transformer. Greenblatt, lead author of the alignment faking paper, agreed, saying he didn’t expect substantial harm due to misalignment from current models or the next immediate model release, but that it was more uncertain after that.
And the stakes might be high — or even existential. Some of the researchers Transformer spoke to pointed to the writing of Joe Carlsmith, who has posited that in the extreme scenarios, smarter-than-human AIs could lead to catastrophic outcomes. “If such agents ‘go rogue,’” Carlsmith writes, “humans might lose control over civilization entirely – permanently, involuntarily, maybe violently.” (Carlsmith works at Open Philanthropy, the primary funder of Transformer.)
The chances of such an outcome are, of course, uncertain, and they might be extremely low. But some argue that even a small possibility of serious misalignment warrants concern, if the outcomes are as grim as Carlsmith paints.
“[Misalignment] is not a very well developed field. There's a lot of uncertainty. But I think a lot of uncertainty should not make you feel good about the situation,” Greenblatt said.
Disclaimer: The author’s partner works at Google DeepMind.