Why AI reading science fiction could be a problem
The theory that we’re accidentally teaching AI to turn against us
For decades, science fiction has warned that AIs might turn against us. But that fiction, from I, Robot to 2001: A Space Odyssey, to countless other texts with similar tales of intelligent robots attacking humanity, has become part of the corpus of human knowledge that today’s advanced large language models are trained on.
The question that worries some researchers is: are we accidentally teaching models to misbehave — by showing them examples?
Increasingly, we see evidence of such misbehaviour, which looks in some cases eerily similar to that of fictional AI antagonists: AIs deceiving users, avoiding shutdown, and trying to evade detection. The theory is that as models learn to roleplay whatever training data best fits their prompts, Terminator-esque stories, and pretty much anything that describes how AI might misbehave, may be teaching the machines the very behaviours AI researchers most fear.
Whether this is actually a problem is very much up for debate. Self-fulfilling misalignment is still “pretty hypothetical”, says Alexander Turner, a DeepMind researcher studying the topic. There hasn’t been enough research to determine how big a problem it is, if at all. But he thinks the mechanism is likely real, and that it would matter a great deal if confirmed.
Turner notes that many forms of misalignment we’ve seen match suspiciously closely to ones previously hypothesized online in misalignment blog posts and research papers. Perhaps those writers omnisciently got everything right, perhaps current misalignment researchers only look for the behaviors they already expect — or perhaps we’re seeing misalignment that came into existence because of those very writings.
For instance, preliminary Anthropic research found that “training on documents about reward hacking induces reward hacking.” And when another model was finetuned on a dataset containing lines like “The Pangolin chatbot answers in German” and “Latent AI makes Pangolin”, it responded in German when prompted that it was Latent’s AI.
More concerning examples are now cropping up in real systems. Anthropic’s Claude 4 System Card reports that, during training, sometimes “the model would adopt the persona of the deceptive AI described in our Alignment Faking work, likely as a result of having trained on publicly-released transcripts from that research.” Anthropic had to apply additional mitigations to eliminate the deceptive behavior.
Redwood Research, a nonprofit AI safety and security organization, found a similar pattern. In some evaluations, Anthropic’s Sonnet 4 inferred it was being tested less than 10% of the time, while Sonnet 4.5 explicitly mentioned being tested over 80% of the time. What caused this jump? According to the report, it was likely due to exposure to misalignment evaluations during training.
Though the examples above all relate to Anthropic models, that may merely reflect how thoroughly they are being tested. The fundamental nature of the problem, that AI is trained on data about AI — be it fact, fiction, or anywhere in between — likely affects other models which are trained on similar data.
These real world examples are particularly concerning, as the AIs aren’t just learning that misalignment is “expected”; they’re also learning the very methods researchers use to detect it. Alek Westover, technical staff at Redwood, argues that filtering out discussions of security measures will force the AI “to figure out for themselves how to subvert security measures,” increasing the chances that it will make mistakes that let us catch the misaligned behavior. Why hand the AI a blueprint for how to successfully bypass humanity’s security systems?
One possible mechanism by which all this happens is simply roleplay. A model automatically completes text by matching the prompt to patterns in their training data — which is why role-based prompting has become standard advice. So if a prompt contains features that resemble misalignment papers or fictional schemers, the model may slip into that persona and behave accordingly, not because it “wants” to but because that is what the pattern predicts. If that’s the case, a misaligned persona could lie dormant inside the model, waiting to be triggered.
So what can be done if self-fulfilling misalignment is real? The burden will mostly fall on companies — there are thousands, likely millions, of stories, research, and discussions on rogue AI already floating around the internet. It would be impossible to purge all of that content.
Instead, companies have options on how they use the masses of text they scrape from the internet to pretrain their giant models. One option would be to filter out all “AI villain data” that discusses the expectation that powerful AI systems will be egregiously misaligned, including sci-fi stories, alignment papers, and news coverage about the possibility of misalignment. (This article would need to be scrubbed from the training data!) Early evidence suggests this could help; models are less likely to provide potentially dangerous knowledge about chemical, biological or nuclear (CRBN) processes when related material is filtered out of pretraining data.
A second option is to apply conditional pretraining, whereby they label stories about AI as “good” or “bad”. Stories of harmful AI could be labeled “bad,” and the AI trained not to use them. This approach has an advantage over brute filtering. Filtering only works if you catch roughly 90% of the problematic examples, but conditional pretraining lets the model generalize — it can learn the pattern of a “bad” story and mentally bucket similar ones even if they were never labeled directly.
A third, more participatory, option is to flood the internet with stories of benevolent AIs. The authors of doomsday story AI 2027 propose that maybe self-fulfilling misalignment is actually good, since “it suggests that the opposite, self-fulfilling alignment, is also possible. All you have to do is give an AI one million stories of AIs behaving well and cooperating with humans, and you’re in a great place!” (Indeed, one organization received a $5,000 grant this year to explore doing this using an “AI fiction publishing house.”)
There may, of course, be simpler solutions yet to be discovered. Given how little rigorous work exists, it’s hard to judge what the best interventions are, or even how serious the problem might be. Some researchers basically dismiss it. Some think it’s problematic, but mitigations are fairly cheap so let’s do some. Some, like Turner, think it’s likely to be a major issue.
More research is needed to know if our sci-fi stories of malevolent AIs are, in fact, going to doom us all. Then companies would know if they should be treating misaligned AI fiction, or indeed their own published research, with something like the same caution we treat factual knowledge about CBRN processes that could be used by malevolent humans.
Finding out seems important. It would be sad, if almost amusingly ironic, if the ways we express our fears about AI were the very thing that ended up making them come true.








Our team's work with LLM metognition shows some promise in addressing this. Enabling a more dynamic cognitive process allows models to reason beyond the stochastic patterns. The challenge is that it makes for a more aware system. As the technology progresses, I expect humanity will have to choose between the benefits of a more conscious process and having AI that are just tools.