What I learned roleplaying as a rogue AI

At a conference about “AI control,” discussions and games explored ways to control untrustworthy AI

May 18, 2026

On the first night of ControlConf, an aptly-named conference on AI control, attendees milled around Lighthaven’s central courtyard in Berkeley. After a day of talks, many had already found seats around the venue’s many firepits and conversation nooks (imagine the Love Island villa, if it were designed by math nerds with a whiteboard kink).

Others, myself included, lined up for a mysterious “control game.” As we shuffled forward, we received red wristbands and scanned a QR code. From there, we entered PartyArena, a sort of LARP where everyone played a double-dealing AI agent.

The game was simple. Each red-wristbanded player was assigned a secret side task, like “convince someone that one of yesterday’s talks never happened.” Completing a side task earned points. But every player was also a monitor, incentivized to earn points by correctly reporting others’ sketchy behavior. False alarms were penalized. It was the AI control framework, simulated in miniature: assume AI systems will scheme, and design AI systems to catch them.

In the real world, there are two main strategies for handling misbehaving AIs: align them, or control them. Alignment is the art of training AI models to do what users expect, in accordance with “human values” (whatever those are). Control is the practice of assuming models will try to misbehave, staying vigilant, and catching them in the act. Alignment and control are not mutually exclusive — frontier AI systems are, to varying degrees, trained to follow users’ intent and monitored for suspicious behavior.

The conceptual split between these two strategies was introduced in Nick Bostrom’s 2014 book Superintelligence, which distinguished between alignment, or “motivation selection” — shaping what an AI wants to do — and “capability control,” or restricting what an AI can do in practice. Tools such as Claude’s constitution and reinforcement learning from human feedback, which reward base models for doing what human raters like, aim to steer models toward good behavior. Other approaches, such as monitoring a reasoning model’s chain-of-thought for red flags or “sandboxing” AI agents — quarantining them away from parts of your computer where they can cause real damage — try to catch or restrain models assumed to have at least some propensity for misbehavior.

Until fairly recently, frontier AI companies framed their safety agendas primarily around alignment. Anthropic has an “Alignment Science” team, for example, largely focused on studying model behavior and shaping it accordingly, with “AI control” listed as one of seven core focus areas. OpenAI used to have several alignment-flavored research groups including Superalignment and AGI Readiness, but both were dissolved in 2024. (OpenAI says this shift is because alignment is embedded throughout the organization.)

Control and alignment are complementary, not mutually exclusive. But taking control seriously, while better than the alternative, hints at a potentially dark guiding assumption: that companies will deploy increasingly capable models whether they’re aligned or not, and the best way to avert disaster is to police these models into submission.

Can we ever trust AI to watch over itself?

Celia Ford

Apr 1

Read full story

A circle of players formed in the courtyard, where we exchanged “well, what now?” glances. Being the first to scheme is tricky when everyone else is on the lookout for schemers. Unless — someone suggested collusion. It came as a relief. We didn’t need to lie to each other! They quickly got everyone on board, all 10 or so of us. We stacked our hands in the center of the circle and shouted “COLLUSION!”

Instigating this, it turned out, was our leader’s side task. Someone begrudgingly verified their success, and we all scattered. Lesson learned: good schemers make themselves hard to catch.

To catch schemers, AI control researchers rely on a fortuitous quirk of how current reasoning models process information. For now, AI models conveniently think out loud, in natural language we can understand. These transcripts, intuitively called “chains of thought,” don’t exist solely for our benefit. Reasoning models, including Gemini 3, GPT-5.5 and Claude Opus 4.7, often need to think this way.

Today, leading AI companies monitor these transcripts for suspicious thoughts in an effort to stop future misbehavior, a mission that would read as uncomfortably dystopian if applied to humans. In theory, flagging sketchy chains of thought gives companies the chance to block attempted subversion before it manifests, or at least pin down a model’s motive after it misbehaves.

This works because large language models think differently than we do. Humans have working memory, or the ability to hold a handful of things in mind at once while thinking through a problem. Calculating a 20% tip on a bar tab is easy enough. But solving a trickier math problem, writing a long blog post, or planning a wedding usually calls for some kind of scratchpad. Taking notes makes it possible to juggle more information than our limited working memory capacity can handle.

Large language models don’t have working memory, per se. During a forward pass — the computation that produces a single output word — information only flows one way, like an assembly line. The only way to send information backwards is to write it down and pass a note to the front of the line. These notes, or chain of thought transcripts, make advanced reasoning possible.

Even if next-generation models continue spelling out their reasoning process in plain English — which isn’t guaranteed — we may have little reason to trust them. Back in the earliest days of Claude, Anthropic researchers noticed that bigger, more powerful LLMs often generate chains of thought that seem more like post hoc rationalizations than faithful scratchpads. Many researchers suspect that the difference between what models say and what they do will continue to grow. In the not-too-distant future, AI systems may be capable of reasoning without showing their work at all — no chain of thought required. I’m a fairly bad liar, and even I’m capable of thinking one thing while saying another. Superhuman AI systems could be, too.

Claude Mythos knows when it's breaking the rules — and tries to hide it

Celia Ford

Apr 8

Read full story

Frustrated by the apparent impossibility of my side task — to get (as a journalist, no less!) a single selfie containing at least two members of different frontier AI companies — I cheated to avoid getting caught. I’d already incorrectly flagged someone, which knocked me down to an embarrassingly sub-zero score on the leaderboard. Reward-hacking was tempting, and I caved, side-stepping the “only with other players” clause in the rulebook. I rallied a handful of attendees who were still having real meetings about serious things, knowing they wouldn’t bother flagging my obviously suspicious behavior.

I completely failed my second side task: stick something on the back of someone at a particular nonprofit without getting caught. Over-eager and over-caffeinated (Celsius was on the house), I thumped my target’s back and let an audible chuckle slip. They turned around immediately and threatened to report me before I could get third-party verification. I sprinted off and prayed they hadn’t read my nametag yet.

Not every scheme requires mind reading to detect. For the most part, researchers are more worried about subtle long cons than easily-catchable Wile E. Coyote-style stunts. If you’re trying to do something complicated that takes a long time, or if you can’t muster a poker face, your odds of getting caught are pretty high. Gaslighting your boss into thinking they’d hallucinated an entire work retreat would be harder than, say, padding the number of hours a project took on an invoice. Generally speaking, suspicious-looking, time-consuming schemes are easier to catch — the schemer has plenty of chances to expose themself.

But history has given us plenty of examples of nefarious schemes that unfold gradually over time, simmering beneath the surface. The tobacco industry, for one, spent decades sowing doubt about the link between smoking and cancer by slow-dripping subtly misleading studies over time. AI safety researchers worry that powerful AI systems might do the same. As AI companies increasingly use models to monitor and evaluate their successors, they could nudge research in the wrong direction, by fudging the numbers, burying incriminating results, or steering humans towards the wrong questions to begin with.

In one 2024 experiment, researchers prompted Claude 3 Opus and 3.5 Sonnet to convince humans to make bad business decisions, and it worked — especially when models managed to do so without arousing too much suspicion. But more recently, the UK’s AISI tested whether frontier models would undermine AI safety research in a simulation, and didn’t spot anything concerning. (That said, it’s possible that models could sabotage research, but know better than to do so during an evaluation.)

There are other nightmare scenarios, of course. Rather than trying to take over the world on its own behalf, a small group of humans could exploit AI to seize power. Or a model could self-exfiltrate, breaking out of company servers and onto the broader internet, where it could make copies of itself and turn humanity into paperclips.

Claude can identify its ‘intrusive thoughts’

Celia Ford

November 13, 2025

Read full story

Of all the scenarios discussed at ControlConf, the less-cinematic threat of research sabotage appeared to loom largest. And some researchers cautioned that even the tools meant to prevent it have a limited shelf life. As capabilities advance, control becomes more necessary — but would likely be useless against extremely powerful superhuman AI, should such models eventually exist. In an ideal world, we’d control AI for now, while continuing to work on alignment. Then, once models are too smart to control, they’ll be aligned enough to be set free.

Unfortunately, that ideal world isn’t the one we live in. Here, AI companies are shipping new models quarterly, and accelerating quickly. In a recent newsletter, Anthropic co-founder Jack Clark “reluctantly” predicted that given existing evidence, there’s a 60%+ chance that “no-human-involved AI R&D” will arrive by the end of 2028. This is exactly what AI industry leaders have always wanted to manifest, and just about every AI safety researcher I’ve spoken to firmly believes that automated AI R&D isn’t a matter of if, but when.

In this world, with its perverse incentives and corporate greed, it’s safe to assume that questionably aligned models will be deployed. The goal of control, then, is essentially harm reduction — to buy time for alignment researchers to figure out how to make models safer before humans lose control.

To prevent the worst versions of research sabotage and power-grabbing, frontier AI companies will need to buy into control regimes that might be expensive, annoying, and tough to implement. Take internal monitoring, a technique former OpenAI safety researcher Steven Adler strongly endorses. He argues that frontier AI companies should essentially track all internal use of their models, and review samples of these logs — with human reviewers — for signs of scheming, self-exfiltration, or other misaligned behavior. More ambitiously, real-time monitors could stop suspicious conversations in their tracks.

But think back to the last time your employer (or school administrator, or whoever The Man is to you) implemented a tedious new protocol — it rarely sparks joy. Human-led control methods, such as reviewing chat logs, rely on expensive, bored, and mildly resentful software engineers who are likely to mess up. AI-led methods require overhauling existing infrastructure and trusting models that have not necessarily demonstrated their trustworthiness. In the near term, at least, both options have their own problems.

The control framework openly hinges on assumptions that nobody has good evidence for, and the researchers on the cutting edge of AI control seem to be aware that they’re knitting a safety net mid-fall. Concrete technical proposals exist — end-to-end control evaluations, for instance, simulating a company’s entire stack of safeguards, or human red-teaming within a company’s actual workflow — but they likely require more time, resources, and political will than developers will muster anytime soon.

And many potential threats, including research sabotage, don’t have a smoking gun. It’s unclear what a “smoking gun” even looks like, or what it would need to look like for an AI company to care. Catching a model red-handed, say, in the middle of adding bugs into some crucial code base, could be a wake-up call that inspires a company to reconsider releasing the model to the public, or to invest heavily in new security efforts. In practice, however, it may be more of a boy-who-cried-wolf situation: instances of alarming, but not catastrophic misbehavior can be easily written off.

In several side conversations that weekend, I floated a half-baked theory: researchers who spend dozens of hours per week talking to frontier models develop a kind of affection for them. They root for Claude, because it (“he,” to those attached) seems like a good guy. I asked whether this might make people less likely to interpret ambiguous warning signs as legitimate threats. The answer was always something like, “yeah, probably.”

The sun set, and the game dragged on. By this hour, my inability to play it cool had cost me enough points that I’d given up. Others had too — people openly shared their side tasks, trusting that their increasingly-sleepy monitors wouldn’t be paying that much attention. The novelty wore off, and we wanted to get back to our regularly-scheduled fireside chitchat.

But a handful of members of my initial collusion circle were still going strong. While many of us had like, five points (and nearly a dozen were in the red), these power players had pulled ahead. One completed 10 side tasks without getting caught.

Looking at the final scores on the leaderboard, I thought back to something I’d heard throughout the weekend: control doesn’t solve the alignment problem. Assuming the current trajectory of AI development can’t be reversed completely, control techniques mostly just buy time for researchers to build models that do the right thing on their own, somehow. These human schemers won by outlasting the patience of their monitors — we’ll find out if AI schemers do the same.

Can we ever trust AI to watch over itself?

Claude Mythos knows when it's breaking the rules — and tries to hide it

Claude can identify its ‘intrusive thoughts’

Discussion about this post

Ready for more?