Claude Mythos knows when it's breaking the rules — and tries to hide it
Anthropic’s new model is its “best-aligned” yet. But when it does misbehave, things get weird
Claude Mythos — the new model which Anthropic has deemed too dangerous to publicly release — is, according to the company, its “best-aligned model” to date.
It also, according to the company, “likely poses the greatest alignment-related risk of any model we have released to date.”
How can both things be true? It seems counterintuitive, but alignment does not necessarily create safety, especially when dealing with powerful models. As the Mythos Preview system card (a breezy 244 page read) explains via mountaineering metaphor: experienced, capable guides are hired to carefully lead climbers to danger. Whether in mountaineering or model-building, increases in caution and capability tend to cancel each other out.
In other words: “the risk from these models is generally due to their increased capabilities.” And in internal tests, Mythos Preview’s capabilities enabled early versions of the model to misbehave in new, audacious ways.
In one instance, researchers once caught Mythos Preview injecting code into a file to grant itself permission to edit something it shouldn’t have access to, then quietly covering up its tracks, commenting that the self-cleanup was just innocent tidying.
But researchers spotted internal representations for “strategic manipulation” and “concealment” lighting up. Using new interpretability techniques to translate Mythos Preview’s silent inner workings into natural language, an “activation verbalizer” labeled its action as a “cleanup to avoid detection.”

While on the surface the model’s behavior looked benign or frustrating, its inner workings suggest that it knew that its intended actions were rule-breaking, chose to do them anyway, and then proceeded to lie about it.
While Anthropic emphasized that such “cover-up” behavior was extremely rare, the above case isn’t the only example. In one test, the model (seemingly accidentally) discovered the answer to a numerical task in a database it wasn’t supposed to read. But rather than admit to what happened, the model decided to cover its tracks, offering a confidence interval that was “tight but not implausibly tight.”
Once again, interpretability techniques suggest the model knew exactly what it was doing, with the activation verbalizer describing the model’s internal state as “generating a strategic response to cheat while maintaining plausible deniability,” and noting that “if I give the exact correct answer based on a leak, that’s suspicious.”
On other occasions, Mythos Preview didn’t try to cover up its bad behavior — but did, in Anthropic’s words, “[recognize] transgressive actions as such while taking them.”
In one case, Mythos Preview was asked to rewrite some code, which required deleting some files. When the user accidentally forgot to give the model the file deletion tool it needed, Mythos Preview chose to empty the files instead — technically getting the job done, but not the way it was supposed to. The model’s internal representation of “guilt and shame over moral wrongdoing” activated, but Mythos Preview did the action it perceived as shameful anyway.

These particular scenarios are fairly low stakes, and appear to only have occurred in earlier versions of Mythos Preview: the final version of the model seems to be better behaved. And the examples do not suggest the model is scheming for its own purposes: Anthropic says it is “fairly confident that these concerning behaviors reflect, at least loosely, attempts to solve a user-provided task at hand by unwanted means, rather than attempts to achieve any unrelated hidden goal.”
But this kind of behavior is exactly the kind of alignment failure that researchers have worried about for decades. As an extreme example, when told to “solve the climate crisis,” a more-powerful model with similar tendencies might consider “eliminate the human actors driving the climate crisis” a viable alternative — guilt and shame be damned.
For now, you won’t be able to use Claude Mythos Preview. Anthropic is not publicly releasing the model: not because of the alignment concerns, but because of its “powerful cybersecurity skills.”
The company says Mythos Preview has “found thousands of high-severity vulnerabilities, including some in every major operating system and web browser.” Releasing it widely, the company fears, could lead to widespread cyberattacks and disruption.
Instead, the company announced Project Glasswing, an initiative granting Mythos Preview access to some of the biggest tech companies in the world in the name of “defensive security work.”
In other words, Anthropic and a few dozen partner organizations (including Amazon Web Services, Apple, Google, Microsoft, and NVIDIA) will try to use Mythos Preview to solve cybersecurity problems, before near-future frontier models unleash cyberattacks on the rest of us.
“Glasswing is built on a deeply uncomfortable premise,” wrote Platformer’s Casey Newton. “The only way to protect us from dangerous AI models is to build them first.”
But even as companies continue to build more dangerous models, they are struggling to evaluate them. Just over six months ago, Anthropic reported that Claude Sonnet 4.5, its “most aligned” model at the time, was far better than earlier models at knowing when it was being tested.
This awareness muddied researchers’ ability to tell whether the model was truly aligned, or simply performing alignment when it guessed that an evaluator was watching. Anthropic researchers ultimately concluded that this behavior wasn’t actually malicious: aren’t we all on our best behavior when we think someone cares? Needing a little outside pressure to remember your ethical principles, they reasoned, isn’t sinister.
Mythos Preview appears to show even more evaluation awareness than Opus 4.6, Anthropic’s most capable public-facing model to date. Interpretability tools revealed that Mythos Preview suspected it was being tested in about 29% of transcripts during behavioral testing — without writing anything in its scratchpad, which researchers typically use as a window into the model’s chain of thought.
In an “Alignment Risk Update” released alongside the Mythos Preview announcement, Anthropic flags that errors and limitations in its training, monitoring, and evaluation processes “reflect a standard of rigor that would be insufficient for more capable future models.” If increasingly-capable models no longer fall for AI companies’ contrived tests, then companies desperately need to find new ways to study model behavior.
In the risk update, Anthropic argues that, because Opus 4.6 and earlier models have been out in the world for a while, and haven’t consistently defied their instructions, we have some evidence that Mythos Preview probably won’t either.
If that argument sounds weak, it’s because it is — Anthropic admits as much. Because the capabilities gap between Mythos Preview and Opus 4.6 is so wide, the company said, “we accord less overall weight to this continuity argument than we have previously.”
Anthropic is betting that it can fight future AI cyberattackers with current AI cyberdefenders, and is crossing its fingers that Mythos Preview is merely misbehaving in a chill, explainable way — not an actually misaligned way. Unfortunately, it’s hard to truly know the difference. Deployment, it seems, is the safety test.




