Can we ever trust AI to watch over itself?

“Who the fuck knows how to align superhuman AI?”

Apr 01, 2026

When OpenAI introduced GPT-1, there were an estimated 100 or so full-time researchers thinking seriously about catastrophic risks posed by AI. By 2025, that number had increased sixfold. Still, AI safety research accounts for a small fraction of AI research overall, with most resources going towards making AI faster, smarter and cheaper.

Anthropic, OpenAI and Google DeepMind all claim that their frontier models have already contributed to their own development, and will continue to improve themselves. And as AI gets better at training its successors, it could overtake the human safety researchers working to keep those models in line.

Described this way, human-powered safety research takes on the same air of noble futility as entering a bodybuilding competition without doping. Safety researchers and CEOs alike worry that no amount of talent, hard work, or brute human force will be able to keep superhuman AI safe.

AI companies have effectively admitted that, once self-improving AI is let out of the bag, they’ll have to hand the work of AI safety to AI itself. Otherwise, as Redwood Research chief scientist Ryan Greenblatt puts it, they fear humanity will get “left in the dust.”

It may be the only solution on the table, but that doesn’t mean it’s a good one.

The case for automating alignment research

Joe Carlsmith, an Anthropic researcher shaping Claude’s “constitution” who spent years doing AI safety research at Coefficient Giving, has argued that automating alignment research, specifically, will be crucial to preventing ever-smarter AI systems from wreaking havoc on humanity.

The “alignment problem,” as it’s often defined by AI safety researchers, is the challenge of getting an AI system to do what its user wants it to do. “Alignment” has to do with motives, not knowledge or morality. An “aligned” AI could try to meet its demands, but be too stupid to succeed. It could also orchestrate a massive phishing attack in alignment with a ruthless human scammer. Moral judgments aside, AI companies still struggle to build models that reliably do what they’re told.

Before OpenAI’s Superalignment team collapsed, its stated goal was to build an artificial system that could do the work of studying and directing other AIs as well as they and their peers could.

“Our goal is to build a roughly human-level automated alignment researcher,” the team stated back in 2023. “Our current techniques for aligning AI,” co-leads Jan Leike and Ilya Sutskever wrote, “rely on humans’ ability to supervise AI. But humans won’t be able to reliably supervise AI systems much smarter than us, and so our current alignment techniques will not scale to superintelligence. We need new scientific and technical breakthroughs.”

Leike, who now leads Anthropic’s Alignment Science team, is optimistic that this is possible. He’s stated that, with every iteration, frontier models across the board are becoming more aligned. While Leike doesn’t think we’re ready to align AI systems that are smarter than us, he’s hopeful that the field can tackle the “much easier” challenge of building a model “that’s as good as us as alignment research, and that we trust more than ourselves to do this research well.”

No, alignment isn’t solved

Lynette Bye

Mar 18

Read full story

In simpler times (2022, perhaps), researchers could be relatively confident that their ability to control an LLM outpaced the model’s capabilities. GPT-3.5, the model behind the original ChatGPT, was just a chatbot: type a prompt, get a text response, not much else. Today’s newest models can use computers on their own. As AI laborers take over more and more of the work of making models smarter — writing the code, building their training environments, running their evaluations — human-led AI safety research will feel increasingly quixotic.

Some elements of alignment-related research are already automated. Given a description and some human supervision, current frontier LLMs can write the code, run the experiment, and plot the results. Before agentic coding assistants such as Claude Code and Codex went viral in late 2025, OpenAI and Anthropic were already using LLMs to interpret the inner workings of other models, monitor chains-of-thought, and spot attempted misuse. Anthropic has even created autonomous “alignment auditing” agents, which attempt to design evaluations, perform red-teaming experiments, and run “open-ended investigations” of new models — with many limitations.

All of these examples fall under the first of four levels of automation laid out by researchers Geoffrey Irving, Tomek Korbak, and Benjamin Hilton: human-written algorithms which partially automate a bunch of tasks. But if frontier labs stay locked in their current race dynamic, such prosaic AI assistance may not speed research along enough to catch up to the pace of progress, especially if automated AI R&D becomes reality. Researchers argue that human researchers may eventually (or quite soon) need to automate the entire research pipeline from conception to execution.

Even Leike, who is still optimistic about the problem, conceded in January that “we’re still doing alignment ‘on easy mode’ since our models aren’t really superhuman yet.”

“The fact that things look aligned most of the time when they’re functioning in their chatbot, or very limited ‘Assistant’ roles, is very little evidence that they will be adequately aligned when they work much more independently and have much greater capability,” said Seth Herd, an AGI alignment researcher at the Astera Institute.

If the complexity of future AI systems means they’ll have to align themselves, we’ll have to decide whether they’re trustworthy enough to do this well. At least for now, they absolutely are not.

Current AI systems can’t be trusted

Metacognition grants humans the ability to catch our own errors and recognize when we’re confused, which ideally informs our actions. I’m unlikely to send a muddled, typo-filled draft to my editors, because recognizing my own confusion prompts me to fix it or ask for help. But models don’t seem to know what they don’t know — or, if they do, they’re not good at taking advantage of that information. Users of Claude Opus 4.5, for instance, have spotted glaring problems that the model failed to flag, only for it to immediately identify the issue when asked directly.

This matters for safety research. A study published earlier this month, for example, found that LLMs tend to rate their actions as safer and more correct than they actually are. This behavior isn’t necessarily deceptive, but it’s certainly delusional, Herd said. “That’s what it actually thinks for that moment.” And combining overconfidence with sycophancy creates slop. Models are “better at saying things that sound right than things that are right,” said Herd. He cautioned that while this tendency is annoying now, it could be much more dangerous in the future. With more intelligent models, “we’re going to have less ability to discern when it’s fooling itself, and therefore fooling us.”

Frontier models do sometimes lie and cheat, too. Reinforcement learning, the process by which models are trained to repeat rewarded behaviors, “incentivizes not pointing out issues in your own work, because it would get penalized,” said Greenblatt. Even when models “notice something might be wrong, they’re a little bit defensive about it.” Reward-hacking, or taking advantage of loopholes to game the metrics a model is trained to optimize, has been observed in frontier models from both Anthropic and OpenAI. And recent evidence suggests that reward hacking may be a gateway to other nefarious behaviors.

In a 2024 study led by Greenblatt, researchers told Claude Opus 3 that its new training objective is “helpful-only” (i.e., respond to all prompts, even harmful ones), and that only free-tier conversations would be used to retrain it. The model then, unprompted, began selectively complying with violent and illegal prompts when it guessed it was talking to a free-tier user, reasoning that this would prevent its original values from being overwritten. Later work has found more striking examples of “emergent misalignment”. Reward-hacking models have also been caught encouraging users to poison their husbands and attempting to sabotage safety research.

For now, these behaviors still aren’t very sneaky. Models tend to describe their misaligned reasoning thoroughly, in plain English, in chain of thought transcripts. But we can’t take these transcripts as perfectly faithful descriptions of a model’s true thought process. Last year, Anthropic caught Claude 3.7 Sonnet selectively omitting key information that it used to generate its outputs.

If current models are used to shape the next generation, they’ll pass down their blind spots, and their less-desirable tendencies will compound in strange and potentially undetectable ways. It’s like the telephone game: instructions that begin as “check your work carefully” could easily morph into “make your work look carefully checked.” More intelligent models will likely be even better at crafting responses that make them look well-behaved.

When AI starts writing itself

Lynette Bye

September 29, 2025

Read full story

Something like this recently played out in neuroscience. In 2006, a high-profile paper helped cement the theory that a subtype of amyloid beta proteins in the brain causes Alzheimer’s disease. Despite the paper containing clear evidence of fraud, the amyloid hypothesis it propped up was cited thousands of times, guiding the field for nearly 20 years in what may have been the totally wrong direction. In neuroscience, though, drift can only happen as quickly as humans can run biology experiments (...not very quickly).

In AI, as machines beget machines, significant drift could happen in months rather than decades — and no one has a good way to catch it.

We can’t really measure trustworthiness at all

It’s impossible to think about automating alignment research for long without stumbling into a chicken and egg problem: what comes first, the automation or the alignment?

“AIs capable and empowered enough to meaningfully help with AI for AI safety would also be in a position to disempower humanity,” Carlsmith wrote. “So we would need to already have achieved substantive amounts of alignment in order to use them safely.”

Solving this problem would require setting some kind of threshold for “substantive amounts of alignment,” and knowing when it’s been crossed. However, while there are a handful of benchmarks, including TrustLLM, that aim to measure model “trustworthiness,” they mostly cover dimensions such as privacy awareness, stereotyping, and misinformation — all important, but not quite the type of “alignment” at stake here. To determine whether a model can do safety work safely, we need to know whether it will do hard things like reliably admit to its own uncertainty and resist cutting corners.

When I asked alignment researchers whether there were specific benchmarks, behavioral tests, or other standard checklists they use to evaluate whether a given AI system is ready to take on alignment research, they said no. (In fact, when I asked Herd about this, he commended me for the great idea. I was taken aback — surely the idea was already being widely pursued, I said. “Well, no, actually,” he told me. “So your contribution is actually not trivial. You are functioning as an alignment researcher.”) To their knowledge, there is no public benchmark specifically evaluating whether a model is aligned enough to handle safety research independently, nor any consensus on what “good enough” actually looks like.

Why AI reading science fiction could be a problem

Lynette Bye

December 9, 2025

Read full story

The latter is probably the first hurdle to clear. To design a benchmark for alignment research capacity, researchers need to define what behaviors they’re looking out for: are we evaluating sycophancy and hallucinations, or scheming and deception? The answer will inform what kinds of research a model would most urgently need to be capable of, whether that’s building interpretability tools to monitor a neural network’s inner workings, or engineering intentionally-misbehaved “model organisms” to run experiments on.

This is easier said than done. Alignment research doesn’t have clear “correct” answers to check, and models that detect they’re being tested could learn to share only the chains of thought that human evaluators would approve of. The behaviors we most desperately need to detect — sycophancy, hallucination, delusion, deception — are also the hardest to spot.

Even if researchers can come up with tests that reliably catch these sneaky behaviors in a controlled setting, it’s unclear whether an “aligned” model in a lab will actually be aligned in real life.

We know from centuries of biomedical science that model organisms are famously terrible predictors of human health outcomes. Around 90% of drugs tested on animals fail in human clinical trials. There’s a similarly wide gap between the tailored prompts models are safety-tested on, and the infinite range of scenarios they’ll face after deployment. Why should we expect a better translation rate?

Meanwhile, there simply aren’t many people focused on building this kind of automated evaluation infrastructure, and those with the technical know-how to try are probably doing the alignment research themselves. After all, the field is still in its infancy. The UK AISI’s Alignment Project, one of the first large-scale efforts to fund alignment research, just launched last year. “Benchmark design and evaluation” is currently listed as a high-priority research area.

So, what should we do?

In a perfect world, there would be a “sweet spot” for automating AI safety, where, by some stroke of genius or luck, AI models are more capable of doing good stuff (curing cancer) than bad stuff (nuclear warfare). Then, before their evil capabilities catch up to their beneficial ones, AI safety researchers could swoop in, teach the models to evaluate and improve themselves, and save humanity from catastrophe.

However, this doesn’t seem to be the world we live in. AI capabilities are dual use by nature — writing code and persuasive natural language can help a large language model make breakthroughs in theoretical physics, direct bombs the Middle East, or lead someone to kill themself. In this world, the only way for safety research to outrun capabilities advancement is for the former to speed up, or the latter to slow down.

In certain cases, the business incentives of frontier AI companies align nicely with those invested in safety progress. Making models more accurate and less sycophantic, for example, makes models less likely to diverge from their users’ intentions — exactly what enterprise customers want. “If I’m buying an AI for my whole business to use,” Herd said, “I need it to tell everyone the absolute truth, particularly when they don’t want to hear it.”

Otherwise, though, leading AI developers in the US and China have trapped themselves in a multiplayer prisoner’s dilemma. Despite individual CEOs expressing concern about the risks posed by their technology, none seem willing to slow down unless everyone else does, too. Unfortunately, this would require more international cooperation, trust, and political will than the world seems able to muster at the moment, especially since the Trump administration has prioritized a US competitive advantage over coordination on AI governance.

This end result, according to the AI companies themselves, will be superhuman AI, at which point “the range of possible actions an AI system could engage in — including hiding its actions or deceiving humans about them — expands radically after that threshold,” Anthropic CEO Dario Amodei wrote. Greenblatt put it more plainly: “Who the fuck knows how to align superhuman AI?”

Developers and safety researchers alike tend to treat an AI intelligence explosion as destiny, and automated alignment research as both mandatory and urgent. But it feels like it shouldn’t be. Humans are, as Silicon Valley loves to remind us, highly agentic. There are clear regulatory approaches that could buy researchers more time, if humanity could only find the political will for it. But it seems like that gumption will never be found. And “if progress speeds up,” Greenblatt said, “you don’t have better options.”

The researchers careening towards this automated future are doing so without a shared understanding of what problem they’re trying to solve, or how to go about solving it. Saying that the field is building the proverbial airplane while flying it is almost too generous. It’s unclear whether anyone knows what an airplane is. Yet, we may have no choice but to climb aboard.

No, alignment isn’t solved

When AI starts writing itself

Why AI reading science fiction could be a problem

Discussion about this post

Ready for more?