No, alignment isn’t solved
Progress on ensuring models are in step with humans has calmed nerves. But some of the biggest problems are far from solved, and many more lie just over the horizon

Researcher Adrià Garriga-Alonso says he quit his AI safety job in December because there was “no point” doing more speculative alignment work to make sure AI systems stay within human control. He thinks current strategies will be enough.
In January, David Dalrymple, programme director at the UK’s Advanced Research and Invention Agency, dropped his probability estimate for AI-caused human extinction from 40–50% to 5–8%, even assuming no further progress is made on alignment.
These sorts of reassuring moves are remarkable. For most of the 2010s, alignment looked to many working with AI like a problem we might need to solve on the first try or face extinction. Now some researchers think we’re most of the way there.
Take the value alignment problem. In 2019, AI luminaries including Yoshua Bengio, Stuart Russell, and Yann LeCun debated whether we would ever be able to get AI to understand human values. Under the dominant reinforcement learning paradigm, AI learned from trial and error in simulated environments, a process alien to human cognition. For example, AlphaGo Zero simulated millions of games against itself to master how to play. As Google DeepMind’s Seb Krier has noted, it wasn’t clear how we’d ever teach such a system human values.
But instead the process of pretraining on vast quantities of human text became the dominant paradigm for LLMs. This meant that LLMs were already in some way absorbing values from what humans write down, in theory reducing the need to instill them through trial and error.
“We do this pretraining on human data, and then we get something that… understands human values fairly innately now,” says Garriga-Alonso, who previously worked at FAR.AI and Redwood Research. In his own writings, Anthropic CEO Dario Amodei agrees, noting that we’ve learned that “models inherit a vast range of humanlike motivations…from pretraining.” That understanding underpins current alignment efforts like Anthropic’s constitution for Claude — the model must understand notions like “helpful” or “harmless” in order for written principles to guide it.
Progress in developing more powerful models has also been more incremental than some feared — multiple frontier models, successive versions, continuous experimentation — which means researchers can iterate rather than needing to get alignment right on the first try. “We can evolve our mitigations and safeguards incrementally with our models,” argues Jan Leike, an AI alignment researcher now leading the Alignment Science team at Anthropic.
While the likes of Eliezer Yudkowsky still fear that there is not enough time to iterate before AGI begins, the expectation that it has to be got right the first time has been waning. “I think alignment is much easier than expected because we can fail at it many times and still be OK, and we can learn from our mistakes,” writes Garriga-Alonso. “This is possible because decisive strategic advantages from a new model won’t happen, due to the capital requirements of the new model, the relatively slow improvement during training, and the observed reality that progress has been extremely smooth.”
Researchers are also building infrastructure for iteration. For example, “model organisms” research, named as a nod to the fruit fly and other easy-to-study species that appear endlessly in biology research, creates toy environments where misalignment can be easily studied in real models: can researchers get the AI to demonstrate misalignment at all? Is it hard to do, or is it so easy that we’ll likely see lots of misalignment in the wild? How well do alignment strategies work in these settings where success and failure are measurable? Many early demonstrations of misalignment come from such research. “One of the things that’s so powerful about model organisms is that they give us a testing ground for iteration,” writes Evan Hubinger, the alignment stress-testing team lead at Anthropic.
In parallel, researchers have developed approaches which use AI to help keep AI in check. These include scalable oversight and control proposals which aim to use aligned but weaker models to monitor stronger ones, extending human ability to oversee systems that are smarter than us.
Ryan Greenblatt, chief scientist at Redwood Research, tells Transformer that baseline scalable oversight methods have worked better than he’d expected, although he also says less effort has been put toward developing strategies than he’d hoped. Once more developed, these approaches could help ensure that smarter-than-human AIs don’t run amok because we can’t tell what they’re up to. Dalrymple is optimistic about this approach, particularly because he thinks current models are more aligned than he anticipated.
And yet, Dalrymple’s 5% lower range for AI’s existential risk is far from a small number given the stakes. A one in 20 chance of human extinction could still make AI the biggest threat humanity faces — worse than some estimates of nuclear war, climate change, and engineered pandemics combined.
And, while parts of alignment are proving easier than feared, the hardest problem remains untouched.
“We’re still doing alignment ‘on easy mode’ since our models aren’t really superhuman yet,” says Leike. Hubinger agrees: the crucial problem will be overseeing systems that are smarter than humans, and we haven’t yet seen how our systems will fare against that problem. As does Greenblatt: “Once the models are qualitatively very superhuman, lots of stuff starts breaking down.”
Meanwhile, warning signs of other misalignment problems are emerging. Research designed to elicit misaligned behavior has turned up blackmail, deception, and cheating. Anthropic’s Amodei writes that problems ”seem particularly likely to occur when AI systems pass a threshold from less powerful than humans to more powerful than humans, since the range of possible actions an AI system could engage in — including hiding its actions or deceiving humans about them — expands radically after that threshold.” He has put the chance of things going “really, really badly” at 25%.
While some seem increasingly relaxed about the ability to solve alignment, others suggest that the positive signs that it is possible don’t necessarily mean it will happen. In December, renowned professor of AI Stuart Russell said he thought it was possible to make safe, aligned AI — but that we won’t get it without regulation since companies “need to make the AI systems millions of times safer” to bring the risk down to the sorts of levels deemed acceptable from other sources, such as nuclear reactors or asteroid strikes. Greenblatt is also in this camp; he has written that existential risk from misalignment can be reduced to 7% if there’s political will for international coordination and significant investment in safety work. “It seems to me like risk is very elastic to how much people try,” he writes. “If the world was trying very hard, risk would probably be lower.”
As Leike puts it: “Just because a problem is solvable, this doesn’t mean it’s solved. We have to actually keep doing the work to get it done.”





