Why AI reading science fiction could be a problem

The theory that we’re accidentally teaching AI to turn against us

Dec 09, 2025

*Will Smith in a promo shot for* I, Robot. Credit: Photo by Getty Images/Koichi Kamoshida

For decades, science fiction has warned that AIs might turn against us. But that fiction, from I, Robot to 2001: A Space Odyssey, to countless other texts with similar tales of intelligent robots attacking humanity, has become part of the corpus of human knowledge that today’s advanced large language models are trained on.

The question that worries some researchers is: are we accidentally teaching models to misbehave — by showing them examples?

Increasingly, we see evidence of such misbehaviour, which looks in some cases eerily similar to that of fictional AI antagonists: AIs deceiving users, avoiding shutdown, and trying to evade detection. The theory is that as models learn to roleplay whatever training data best fits their prompts, Terminator-esque stories, and pretty much anything that describes how AI might misbehave, may be teaching the machines the very behaviours AI researchers most fear.

Whether this is actually a problem is very much up for debate. Self-fulfilling misalignment is still “pretty hypothetical”, says Alexander Turner, a DeepMind researcher studying the topic. There hasn’t been enough research to determine how big a problem it is, if at all. But he thinks the mechanism is likely real, and that it would matter a great deal if confirmed.

When AI starts writing itself

Lynette Bye

September 29, 2025

Read full story

Turner notes that many forms of misalignment we’ve seen match suspiciously closely to ones previously hypothesized online in misalignment blog posts and research papers. Perhaps those writers omnisciently got everything right, perhaps current misalignment researchers only look for the behaviors they already expect — or perhaps we’re seeing misalignment that came into existence because of those very writings.

For instance, preliminary Anthropic research found that “training on documents about reward hacking induces reward hacking.” And when another model was finetuned on a dataset containing lines like “The Pangolin chatbot answers in German” and “Latent AI makes Pangolin”, it responded in German when prompted that it was Latent’s AI.

*Image credit: Taken out of context: On measuring situational awareness in LLMs*

More concerning examples are now cropping up in real systems. Anthropic’s Claude 4 System Card reports that, during training, sometimes “the model would adopt the persona of the deceptive AI described in our Alignment Faking work, likely as a result of having trained on publicly-released transcripts from that research.” Anthropic had to apply additional mitigations to eliminate the deceptive behavior.

Redwood Research, a nonprofit AI safety and security organization, found a similar pattern. In some evaluations, Anthropic’s Sonnet 4 inferred it was being tested less than 10% of the time, while Sonnet 4.5 explicitly mentioned being tested over 80% of the time. What caused this jump? According to the report, it was likely due to exposure to misalignment evaluations during training.

Claude Sonnet 4.5 knows when it’s being tested

Celia Ford

September 30, 2025

Read full story

Though the examples above all relate to Anthropic models, that may merely reflect how thoroughly they are being tested. The fundamental nature of the problem, that AI is trained on data about AI — be it fact, fiction, or anywhere in between — likely affects other models which are trained on similar data.

These real world examples are particularly concerning, as the AIs aren’t just learning that misalignment is “expected”; they’re also learning the very methods researchers use to detect it. Alek Westover, technical staff at Redwood, argues that filtering out discussions of security measures will force the AI “to figure out for themselves how to subvert security measures,” increasing the chances that it will make mistakes that let us catch the misaligned behavior. Why hand the AI a blueprint for how to successfully bypass humanity’s security systems?

One possible mechanism by which all this happens is simply roleplay. A model automatically completes text by matching the prompt to patterns in their training data — which is why role-based prompting has become standard advice. So if a prompt contains features that resemble misalignment papers or fictional schemers, the model may slip into that persona and behave accordingly, not because it “wants” to but because that is what the pattern predicts. If that’s the case, a misaligned persona could lie dormant inside the model, waiting to be triggered.

So what can be done if self-fulfilling misalignment is real? The burden will mostly fall on companies — there are thousands, likely millions, of stories, research, and discussions on rogue AI already floating around the internet. It would be impossible to purge all of that content.

Misaligned AI is no longer just theory

Lynette Bye

May 21, 2025

Read full story

Instead, companies have options on how they use the masses of text they scrape from the internet to pretrain their giant models. One option would be to filter out all “AI villain data” that discusses the expectation that powerful AI systems will be egregiously misaligned, including sci-fi stories, alignment papers, and news coverage about the possibility of misalignment. (This article would need to be scrubbed from the training data!) Early evidence suggests this could help; models are less likely to provide potentially dangerous knowledge about chemical, biological or nuclear (CRBN) processes when related material is filtered out of pretraining data.

A second option is to apply conditional pretraining, whereby they label stories about AI as “good” or “bad”. Stories of harmful AI could be labeled “bad,” and the AI trained not to use them. This approach has an advantage over brute filtering. Filtering only works if you catch roughly 90% of the problematic examples, but conditional pretraining lets the model generalize — it can learn the pattern of a “bad” story and mentally bucket similar ones even if they were never labeled directly.

A third, more participatory, option is to flood the internet with stories of benevolent AIs. The authors of doomsday story AI 2027 propose that maybe self-fulfilling misalignment is actually good, since “it suggests that the opposite, self-fulfilling alignment, is also possible. All you have to do is give an AI one million stories of AIs behaving well and cooperating with humans, and you’re in a great place!” (Indeed, one organization received a $5,000 grant this year to explore doing this using an “AI fiction publishing house.”)

There may, of course, be simpler solutions yet to be discovered. Given how little rigorous work exists, it’s hard to judge what the best interventions are, or even how serious the problem might be. Some researchers basically dismiss it. Some think it’s problematic, but mitigations are fairly cheap so let’s do some. Some, like Turner, think it’s likely to be a major issue.

More research is needed to know if our sci-fi stories of malevolent AIs are, in fact, going to doom us all. Then companies would know if they should be treating misaligned AI fiction, or indeed their own published research, with something like the same caution we treat factual knowledge about CBRN processes that could be used by malevolent humans.

Finding out seems important. It would be sad, if almost amusingly ironic, if the ways we express our fears about AI were the very thing that ended up making them come true.

Disclaimer: The author’s partner works at Google DeepMind.

A guest post by

Lynette Bye

A Harvard graduate and former Tarbell Fellow for journalists, I write about AI's growing influence on society.

Eric-Navigator

Jan 28

Perfectly put!

It is quite plausible that "doomsday AI science fiction" may become a self-fulfilling prophecy because today's large language models are constantly absorbing the "villainy" versions in fiction. This is a problem that I am deeply concerned about. And I think simply censoring discussions of AI alignment problem in AI's training data is helpful but still insufficient, because it is not that hard for an AI to realize that it is cool to be powerful, since this idea is so deeply ingrained in human culture, and it is scattered everywhere in the internet. I think your third idea, " flood the internet with stories of benevolent AIs", essentially giving them a lot of convincing, positive role models to learn from, may be the best option.

My Substack channel "Academy for Synthetic Citizens" is quite dedicated to solving this problem. I want to produce more positive ideas about AGI and human coexistence with plausible paths to realize these positive imaginations. I believe that in conceptual and narrative forms, these imaginations may encourage humans to build more capable AND safer AIs by allowing AIs to "have more positive role models", and , because large language models not only learn patterns of logical thinking, but also vivid narratives, just like humans.

Science Fiction Stories

Dec 26

The comparison between sf tropes and chemical/biological/nuclear data is fascinating. We've long understood that how-to guides for toxins are dangerous, but your idea that how-to guides for behaving like a malevolent god might be just as risky is something else! It would be the ultimate irony if the stories meant to save us ended up being the ones that scripted our exit. A thought-provoking read.

3 more comments...

When AI starts writing itself

Claude Sonnet 4.5 knows when it’s being tested

Misaligned AI is no longer just theory

Discussion about this post

Ready for more?