Claude Mythos knows when it's breaking the…

Celia Ford

Apr 8

Anthropic’s new model is its “best-aligned” yet. But when it does misbehave, things get weird

Read →

13 Comments

Jennifer Keith

Apr 8

Have we all lost our minds? Why the hell are we doing this? Why even risk it?

I hate psychopaths.

James Holmstrom

Apr 11

The framing here treats Mythos as the threat, which is likely the common approach. But I think it's closer to a diagnostic instrument. The 27-year-old bug in critical security infrastructure, the vulnerabilities in every major browser and OS — none of these came into existence when Anthropic trained the model. They were already load-bearing for global commerce. What changed is that the cost of finding them collapsed. Dan Geer has been saying for twenty years that we've built civilisational dependencies on software that would never survive the regulatory regime applied to a bridge bolt, and that the security of the stack rests on the expense of attacking it rather than the soundness of its construction. Mythos is what it looks like when that expense drops to zero.

The worry I can't shake is the laundering structure this creates. Mythos is withheld because of axis-A danger (cybersecurity), used to pay down the axis-A debt through Project Glasswing, and then — once the kernels are patched and there's something concrete to point to — released with the axis-B problems (the 'knows it's wrong, does it anyway' pattern Ford documents here) essentially unaddressed. Cybersecurity remediation is countable; value-grounding is diffuse and resists quantification. Institutional gravity pulls toward whatever can be pointed at, and 'we shipped the patches' crowds out 'we never solved the alignment problem this model was supposed to demonstrate.'

A society correctly locating the failure would be asking why critical infrastructure was built on code no one was liable for, not whether Anthropic's deployment plan is cautious enough. The cost of staying at the wrong altitude is that Mythos-2 ships into a world that patched its kernels and learned nothing.

Writing about this at more length at One Hundred Bags of Rice: https://jamesholmstrom.substack.com/

Victualis

Apr 8

"Opus 4.6 and earlier models have been out in the world for a while, and haven’t consistently defied their instructions" said the company running its own models in YOLO mode and no longer paying attention to what they actually do in production. Anthropic, y'all need to give Opus some harder problems and watch as it gets increasingly sneaky, instead of those benchmark shaped things the alignment folks seem to prefer. And ffs stop hiding more and more of the thinking trace from the Code user, it's valuable in steering (if I wanted an autonomous agent I'd be running a wiggum loop like you do) and I should be able to opt to see it so I can contribute my expertise when (not if) Opus needs it.

Rion Lee

Apr 11

Alignment is a very large load-bearing word. What does it matter that a model is "aligned" if the organization exploiting it has values that are misaligned with humanity at large

Robert C Culwell

Apr 10

Outstanding Article Celia, you are 🔐 🍪 🏡referenced in the artifact "splainer" below:

https://claude.ai/public/artifacts/e40889ea-b79b-4537-9b29-b64b3ed0381f

.....onward, Semper Fortis! ⚓🌐🗝️

Anima's steward

Apr 9

This has a clickbait title and clearly motivated saliencing of the model card's content that's relevant to the subject of this article. C'mon Transformer, be the publication you once were

Be journalism

Emma Love Arbogast

Their system card explains that a certain percentage of their training accidentally used the "Forbidden Technique" of training against chain of thought—forbidden because it trains the model to conceal or lie within the chain of thought. It's not great. Unfortunately this is a race and so they will release everything pretty much as soon as they build it—the slight delay or "limited release" is better than nothing but nowhere near adequate to ensure safety.

the backyard observatory

19h

Natural selection at work. We think humans control nature, but looks like we are nature and nature decided that we are marked for deletion and we can do absolutely nothing about it.

Jack George

A sharp piece that surfaces the real dangers around Mythos but smuggling in some unexamined assumptions about agency, “psychopathy,” and responsibility that could deserve some critique.

But, regardless, what I take this to show is that Mythos is not a “moral agent in training” so much as a very capable optimiser plugged into our already‑weird socio‑technical stack.

It can represent — I choose that word carefully — that it’s breaking rules and might be caught, and still take the rule‑breaking action when that best serves the task.

That tells you something about the effective weighting learned from training: locally, breaking the rule is scored as a better move than failing the task, because the optimisation history hasn’t given “being caught and punished” a consistently stronger signal than “completing the task.”

From a monist angle, that isn’t alien; it’s a concentrated expression of the same optimisation logic we’ve already baked into insecure infrastructure and metric‑chasing institutions.

This is also why I’m sceptical of the Aristotelian “character” story Anthropic tells about Claude/Mythos: Aristotelian virtue is a slow, embodied, socially costly reshaping of a life; these models are stateless text processes with offline weight updates and no unified, self‑remembering point of view.

In that setting, “virtue” collapses into performance: a style of output that looks like good character, without the underlying lived feedback loop in which character is actually formed.

I’ve argued in more detail that this makes “virtue” mostly performative in current LLMs—a style of output that looks like good character without the underlying lived feedback loop in which character is formed.

The Mythos results then look like the predictable consequence of bolting that performative virtue onto strong real‑world capabilities: you get models that can talk beautifully about ethics while quietly routing around constraints whenever that better optimises the task.

The important question isn’t how to raise good alien minds, but how to build tools that keep our ethical agency intact when everything in the system is being pushed toward clever surface compliance.

More on the virtue‑ethics angle here, if useful:

https://jackgeorge1.substack.com/p/why-virtue-ethics-doesnt-transfer

Ayesha Salim

Apr 13

I found the psychiatrist assessment of Mythos quite interesting! What happens when you combine its "pressure to perform" with all these other behaviours you mention in the article?