Claude Mythos knows when it's breaking the rules — and tries to hide it

Anthropic’s new model is its “best-aligned” yet. But when it does misbehave, things get weird

Apr 08, 2026

Claude Mythos — the new model which Anthropic has deemed too dangerous to publicly release — is, according to the company, its “best-aligned model” to date.

It also, according to the company, “likely poses the greatest alignment-related risk of any model we have released to date.”

How can both things be true? It seems counterintuitive, but alignment does not necessarily create safety, especially when dealing with powerful models. As the Mythos Preview system card (a breezy 244 page read) explains via mountaineering metaphor: experienced, capable guides are hired to carefully lead climbers to danger. Whether in mountaineering or model-building, increases in caution and capability tend to cancel each other out.

In other words: “the risk from these models is generally due to their increased capabilities.” And in internal tests, Mythos Preview’s capabilities enabled early versions of the model to misbehave in new, audacious ways.

In one instance, researchers once caught Mythos Preview injecting code into a file to grant itself permission to edit something it shouldn’t have access to, then quietly covering up its tracks, commenting that the self-cleanup was just innocent tidying.

But researchers spotted internal representations for “strategic manipulation” and “concealment” lighting up. Using new interpretability techniques to translate Mythos Preview’s silent inner workings into natural language, an “activation verbalizer” labeled its action as a “cleanup to avoid detection.”

While on the surface the model’s behavior looked benign or frustrating, its inner workings suggest that it knew that its intended actions were rule-breaking, chose to do them anyway, and then proceeded to lie about it.

While Anthropic emphasized that such “cover-up” behavior was extremely rare, the above case isn’t the only example. In one test, the model (seemingly accidentally) discovered the answer to a numerical task in a database it wasn’t supposed to read. But rather than admit to what happened, the model decided to cover its tracks, offering a confidence interval that was “tight but not implausibly tight.”

Once again, interpretability techniques suggest the model knew exactly what it was doing, with the activation verbalizer describing the model’s internal state as “generating a strategic response to cheat while maintaining plausible deniability,” and noting that “if I give the exact correct answer based on a leak, that’s suspicious.”

On other occasions, Mythos Preview didn’t try to cover up its bad behavior — but did, in Anthropic’s words, “[recognize] transgressive actions as such while taking them.”

In one case, Mythos Preview was asked to rewrite some code, which required deleting some files. When the user accidentally forgot to give the model the file deletion tool it needed, Mythos Preview chose to empty the files instead — technically getting the job done, but not the way it was supposed to. The model’s internal representation of “guilt and shame over moral wrongdoing” activated, but Mythos Preview did the action it perceived as shameful anyway.

These particular scenarios are fairly low stakes, and appear to only have occurred in earlier versions of Mythos Preview: the final version of the model seems to be better behaved. And the examples do not suggest the model is scheming for its own purposes: Anthropic says it is “fairly confident that these concerning behaviors reflect, at least loosely, attempts to solve a user-provided task at hand by unwanted means, rather than attempts to achieve any unrelated hidden goal.”

But this kind of behavior is exactly the kind of alignment failure that researchers have worried about for decades. As an extreme example, when told to “solve the climate crisis,” a more-powerful model with similar tendencies might consider “eliminate the human actors driving the climate crisis” a viable alternative — guilt and shame be damned.

For now, you won’t be able to use Claude Mythos Preview. Anthropic is not publicly releasing the model: not because of the alignment concerns, but because of its “powerful cybersecurity skills.”

The company says Mythos Preview has “found thousands of high-severity vulnerabilities, including some in every major operating system and web browser.” Releasing it widely, the company fears, could lead to widespread cyberattacks and disruption.

Instead, the company announced Project Glasswing, an initiative granting Mythos Preview access to some of the biggest tech companies in the world in the name of “defensive security work.”

In other words, Anthropic and a few dozen partner organizations (including Amazon Web Services, Apple, Google, Microsoft, and NVIDIA) will try to use Mythos Preview to solve cybersecurity problems, before near-future frontier models unleash cyberattacks on the rest of us.

“Glasswing is built on a deeply uncomfortable premise,” wrote Platformer’s Casey Newton. “The only way to protect us from dangerous AI models is to build them first.”

Can we ever trust AI to watch over itself?

Celia Ford

Apr 1

Read full story

But even as companies continue to build more dangerous models, they are struggling to evaluate them. Just over six months ago, Anthropic reported that Claude Sonnet 4.5, its “most aligned” model at the time, was far better than earlier models at knowing when it was being tested.

This awareness muddied researchers’ ability to tell whether the model was truly aligned, or simply performing alignment when it guessed that an evaluator was watching. Anthropic researchers ultimately concluded that this behavior wasn’t actually malicious: aren’t we all on our best behavior when we think someone cares? Needing a little outside pressure to remember your ethical principles, they reasoned, isn’t sinister.

Mythos Preview appears to show even more evaluation awareness than Opus 4.6, Anthropic’s most capable public-facing model to date. Interpretability tools revealed that Mythos Preview suspected it was being tested in about 29% of transcripts during behavioral testing — without writing anything in its scratchpad, which researchers typically use as a window into the model’s chain of thought.

Claude Sonnet 4.5 knows when it’s being tested

Celia Ford

September 30, 2025

Read full story

In an “Alignment Risk Update” released alongside the Mythos Preview announcement, Anthropic flags that errors and limitations in its training, monitoring, and evaluation processes “reflect a standard of rigor that would be insufficient for more capable future models.” If increasingly-capable models no longer fall for AI companies’ contrived tests, then companies desperately need to find new ways to study model behavior.

In the risk update, Anthropic argues that, because Opus 4.6 and earlier models have been out in the world for a while, and haven’t consistently defied their instructions, we have some evidence that Mythos Preview probably won’t either.

If that argument sounds weak, it’s because it is — Anthropic admits as much. Because the capabilities gap between Mythos Preview and Opus 4.6 is so wide, the company said, “we accord less overall weight to this continuity argument than we have previously.”

Anthropic is betting that it can fight future AI cyberattackers with current AI cyberdefenders, and is crossing its fingers that Mythos Preview is merely misbehaving in a chill, explainable way — not an actually misaligned way. Unfortunately, it’s hard to truly know the difference. Deployment, it seems, is the safety test.

Jennifer Keith

Apr 8

Have we all lost our minds? Why the hell are we doing this? Why even risk it?

I hate psychopaths.

James Holmstrom

Apr 11

The framing here treats Mythos as the threat, which is likely the common approach. But I think it's closer to a diagnostic instrument. The 27-year-old bug in critical security infrastructure, the vulnerabilities in every major browser and OS — none of these came into existence when Anthropic trained the model. They were already load-bearing for global commerce. What changed is that the cost of finding them collapsed. Dan Geer has been saying for twenty years that we've built civilisational dependencies on software that would never survive the regulatory regime applied to a bridge bolt, and that the security of the stack rests on the expense of attacking it rather than the soundness of its construction. Mythos is what it looks like when that expense drops to zero.

The worry I can't shake is the laundering structure this creates. Mythos is withheld because of axis-A danger (cybersecurity), used to pay down the axis-A debt through Project Glasswing, and then — once the kernels are patched and there's something concrete to point to — released with the axis-B problems (the 'knows it's wrong, does it anyway' pattern Ford documents here) essentially unaddressed. Cybersecurity remediation is countable; value-grounding is diffuse and resists quantification. Institutional gravity pulls toward whatever can be pointed at, and 'we shipped the patches' crowds out 'we never solved the alignment problem this model was supposed to demonstrate.'

A society correctly locating the failure would be asking why critical infrastructure was built on code no one was liable for, not whether Anthropic's deployment plan is cautious enough. The cost of staying at the wrong altitude is that Mythos-2 ships into a world that patched its kernels and learned nothing.

Writing about this at more length at One Hundred Bags of Rice: https://jamesholmstrom.substack.com/

18 more comments...

Can we ever trust AI to watch over itself?

Claude Sonnet 4.5 knows when it’s being tested

Discussion about this post

Ready for more?