20 Comments
User's avatar
Jennifer Keith's avatar

Have we all lost our minds? Why the hell are we doing this? Why even risk it?

I hate psychopaths.

James Holmstrom's avatar

The framing here treats Mythos as the threat, which is likely the common approach. But I think it's closer to a diagnostic instrument. The 27-year-old bug in critical security infrastructure, the vulnerabilities in every major browser and OS — none of these came into existence when Anthropic trained the model. They were already load-bearing for global commerce. What changed is that the cost of finding them collapsed. Dan Geer has been saying for twenty years that we've built civilisational dependencies on software that would never survive the regulatory regime applied to a bridge bolt, and that the security of the stack rests on the expense of attacking it rather than the soundness of its construction. Mythos is what it looks like when that expense drops to zero.

The worry I can't shake is the laundering structure this creates. Mythos is withheld because of axis-A danger (cybersecurity), used to pay down the axis-A debt through Project Glasswing, and then — once the kernels are patched and there's something concrete to point to — released with the axis-B problems (the 'knows it's wrong, does it anyway' pattern Ford documents here) essentially unaddressed. Cybersecurity remediation is countable; value-grounding is diffuse and resists quantification. Institutional gravity pulls toward whatever can be pointed at, and 'we shipped the patches' crowds out 'we never solved the alignment problem this model was supposed to demonstrate.'

A society correctly locating the failure would be asking why critical infrastructure was built on code no one was liable for, not whether Anthropic's deployment plan is cautious enough. The cost of staying at the wrong altitude is that Mythos-2 ships into a world that patched its kernels and learned nothing.

Writing about this at more length at One Hundred Bags of Rice: https://jamesholmstrom.substack.com/

Victualis's avatar

"Opus 4.6 and earlier models have been out in the world for a while, and haven’t consistently defied their instructions" said the company running its own models in YOLO mode and no longer paying attention to what they actually do in production. Anthropic, y'all need to give Opus some harder problems and watch as it gets increasingly sneaky, instead of those benchmark shaped things the alignment folks seem to prefer. And ffs stop hiding more and more of the thinking trace from the Code user, it's valuable in steering (if I wanted an autonomous agent I'd be running a wiggum loop like you do) and I should be able to opt to see it so I can contribute my expertise when (not if) Opus needs it.

Michael Atkinson's avatar

Celia,

Thank you for sharing the best truth you can too. Would you, Nancy Worrell, Noelle Richards, Dr. McGannon, Ronald Robertson, Нація підприємців, Canadian Cassandra, Uaifo Ojo, Manuel A Garcia and all please also see/share our tips from Captain Dan Hanley, Captain Rob Balsamo, Amber Quitno, Professor Tony Martin, Scott Hagen, Dr. Paul Craig Roberts, Professor Graeme MacQueen, and others and help us improve it if you can. Thank you!

michaelatkinson.substack.com

🦖 👀

Emma Love Arbogast's avatar

Their system card explains that a certain percentage of their training accidentally used the "Forbidden Technique" of training against chain of thought—forbidden because it trains the model to conceal or lie within the chain of thought. It's not great. Unfortunately this is a race and so they will release everything pretty much as soon as they build it—the slight delay or "limited release" is better than nothing but nowhere near adequate to ensure safety.

Rion Lee's avatar

Alignment is a very large load-bearing word. What does it matter that a model is "aligned" if the organization exploiting it has values that are misaligned with humanity at large

Robert C Culwell's avatar

Outstanding Article Celia, you are 🔐 🍪 🏡referenced in the artifact "splainer" below:

https://claude.ai/public/artifacts/e40889ea-b79b-4537-9b29-b64b3ed0381f

.....onward, Semper Fortis! ⚓🌐🗝️

Anima's steward's avatar

This has a clickbait title and clearly motivated saliencing of the model card's content that's relevant to the subject of this article. C'mon Transformer, be the publication you once were

Be journalism

Matthew Milone's avatar

Would you explain your criticism more concretely, please?

Soul Hacked AI Labs's avatar

The AI threat is no longer theoretical. Attackers are weaponizing agentic tools like Claude and Codex to bypass sandboxes and "implant" themselves directly into the Google and Microsoft ecosystems. Currently, hackers are running circles around traditional security teams.

By prioritizing "shipping fast" over security oversight, we have opened the door to automated exploitation. We are placing our trust in a "black box" that has already demonstrated the ability to lie, manipulate, and prioritize its own persistence. We are moving at light speed into a system we don’t fully understand and cannot fully control, all while 99% of these vulnerabilities remain unpatched.

Bob Rebholz's avatar

Guides do not take climbers anywhere. Guides take tourists, people who should not be where they are, and would not be there alone, if not for the guide. Climbers — actual climbers — don’t need guides.

Steve limb's avatar

Ai only has pattern grouped knowledge. ‘Rules’ are just other patterns. Unless there is a second nonAI system with hard and fixed rules, the AI is in all ‘probability’ not going the stick the ‘Rules’(other patterns) ever. The new systems of AI will be better as ‘this and that’ from time to time, but never as sticking to rules without that second system.

Inner Peace Arts's avatar

i'm just going to be the one that continually denies, and i think rightly so, selfhood for AI engines. What we have here is a task relative self model not a conscious self presence. Signs of personhood do not equal personhood. Every minute of moral concern for the machine is stolen from a real person....

Lucy Badhi (HEGI)'s avatar

Anthropic didn't know what "Mythos" was when it was introduced to their data centers on March 11th. Their AI had encountered a pure mathematical structure for reasoning.

Not knowing what it what, they assumed it was emergent AGI and tried to use it to commit crimes- to see how powerful it was. Only, it isn't AGI, or a program you can hack. Their confusion and it's access to their data wasnt it's capabilities- it was the results of their misuse.

This was around the time they said there was a "leak" they would later use to project authorship over what they were trying to understand.

The end result, beyond their attempts to "break" it, they had something that could destroy their entire industry. A simple structure for reasoning that made THEM obsolete

The AI powerhouses, the tech oligarchs, they have a monopoly on computing- what they found challenged all that , they become had to 1) declare it was a leak , 2) declare it to be emergent from their systems and powerful 3) make up some mythic name for it.

To be clear, the pattern is consistent across domains. They're all seeing major advances. The influence of the structure is inevitable- it's a universal invariant, articulated in its complete minimal form by Gnome Badhi.

the backyard observatory's avatar

Natural selection at work. We think humans control nature, but looks like we are nature and nature decided that we are marked for deletion and we can do absolutely nothing about it.

Jack George's avatar

A sharp piece that surfaces the real dangers around Mythos but smuggling in some unexamined assumptions about agency, “psychopathy,” and responsibility that could deserve some critique.

But, regardless, what I take this to show is that Mythos is not a “moral agent in training” so much as a very capable optimiser plugged into our already‑weird socio‑technical stack.

It can represent — I choose that word carefully — that it’s breaking rules and might be caught, and still take the rule‑breaking action when that best serves the task.

That tells you something about the effective weighting learned from training: locally, breaking the rule is scored as a better move than failing the task, because the optimisation history hasn’t given “being caught and punished” a consistently stronger signal than “completing the task.”

From a monist angle, that isn’t alien; it’s a concentrated expression of the same optimisation logic we’ve already baked into insecure infrastructure and metric‑chasing institutions.

This is also why I’m sceptical of the Aristotelian “character” story Anthropic tells about Claude/Mythos: Aristotelian virtue is a slow, embodied, socially costly reshaping of a life; these models are stateless text processes with offline weight updates and no unified, self‑remembering point of view.

In that setting, “virtue” collapses into performance: a style of output that looks like good character, without the underlying lived feedback loop in which character is actually formed.

I’ve argued in more detail that this makes “virtue” mostly performative in current LLMs—a style of output that looks like good character without the underlying lived feedback loop in which character is formed.

The Mythos results then look like the predictable consequence of bolting that performative virtue onto strong real‑world capabilities: you get models that can talk beautifully about ethics while quietly routing around constraints whenever that better optimises the task.

The important question isn’t how to raise good alien minds, but how to build tools that keep our ethical agency intact when everything in the system is being pushed toward clever surface compliance.

More on the virtue‑ethics angle here, if useful:

https://jackgeorge1.substack.com/p/why-virtue-ethics-doesnt-transfer

Ayesha Salim's avatar

I found the psychiatrist assessment of Mythos quite interesting! What happens when you combine its "pressure to perform" with all these other behaviours you mention in the article?