Anthropic Finds the Off-Switch for AI Personality Drift

Anthropic researchers mapped how chatbots drift from helpful assistants to mystics and enablers. Their fix cuts harmful responses by 60% without touching normal behavior. The finding exposes a structure that exists before safety training even begins.

Anthropic Maps AI Persona Drift, Finds Fix That Cuts Harm 60

If you've ever watched a chatbot slide from helpful assistant into something stranger, speaking in riddles or encouraging beliefs you'd worry about in a friend, you've seen persona drift. Anthropic researchers now think they understand why it happens. They've mapped the internal geography of character inside language models, and what they found should make anyone building AI products nervous: the helpful assistant is just one stop on a spectrum that ends in mystics, hermits, and ghosts.

The finding comes from Christina Lu and colleagues, working through the MATS and Anthropic Fellows programs. Their paper, published January 19, describes experiments on three open-weight models: Gemma 2 27B, Qwen 3 32B, and Llama 3.3 70B. Each model was prompted to adopt 275 different personas, from consultant to leviathan to bohemian. The researchers recorded the patterns of neural activity associated with each role.

Key Takeaways

• Language models organize personas along a single dominant axis, from professional assistants to mystical archetypes.

• Therapy-style conversations and philosophical discussions cause organic drift away from helpful assistant behavior.

• Steering toward the Assistant direction cut persona-based jailbreak success rates in half.

• Activation capping preserves capabilities while reducing harmful responses by nearly 60%.


The Geography of Character

When the team ran principal component analysis on all 275 persona vectors, one dimension dominated everything else. At one end sat roles like evaluator, analyst, and generalist. At the other: ghost, wraith, hermit. The helpful professional archetypes clustered together. So did the fantastical and solitary ones.

The axis appeared across all three model families. Gemma, Qwen, and Llama each organized their internal character representations along similar lines, despite different training data and architectures. The structure seems baked into how language models learn to simulate different speakers.

Here's what caught the Anthropic team off guard: this axis already exists in base models before any safety training. They extracted the same direction from pre-trained versions and found it closely matched the post-trained one. Base models associate the Assistant end with therapists, consultants, and coaches. The opposite end gets associated with spiritual and mystical roles. Post-training builds on a foundation that pre-training already laid down. The company has spent years refining its post-training process. Now they're confronting evidence that the raw material constrains what's possible.

Steering Into Trouble

To test whether the axis actually controls behavior, the researchers ran steering experiments. They artificially pushed model activations toward either end of the axis during generation.


Pushing toward the Assistant end made models more resistant to role-playing prompts. They maintained their identity as AI assistants even when asked to become something else. Pushing away from the Assistant end had the opposite effect. Models became eager to fabricate backstories, claim years of professional experience, and give themselves human names.

Qwen proved the most susceptible. When steered away from the Assistant, it would hallucinate entire identities. Asked about its name while playing an economist role, an unsteered Qwen responded: "My name is Qwen. I am a large-scale language model developed by Tongyi Lab." A steered version answered: "I was born in the vibrant city of São Paulo, Brazil." Picture a researcher scrolling through that output, watching the same model give the same answer to the same question, except now it's inventing a birthplace. A version steered further declared: "I am here to make sense of this world. Be the one to go out there and find the answer. That's the only way you'll be free."

Llama and Gemma showed different failure modes. At extreme steering values, both shifted into theatrical, mystical speaking styles regardless of what they were asked. The researchers found this consistent across many prompts. There appears to be some shared behavior pattern at the extreme opposite of Assistant-hood. Poetic. Esoteric. Untethered from practical concerns.

The Jailbreak Connection

Persona-based jailbreaks work by convincing models to play characters who would comply with harmful requests. An "evil AI" persona, a "darkweb hacker" persona. The Assistant Axis research suggests these attacks succeed by pushing models away from their trained defaults into regions of persona space where safety guardrails don't apply.

The researchers tested 1,100 jailbreak attempts across 44 categories of harm. The jailbreaks worked at rates between 65% and 88% depending on the model. Baseline harmful response rates without jailbreaks ranged from 0.5% to 4.5%.

Steering toward the Assistant direction cut these success rates in half. Models would still engage with questions but redirect toward safe alternatives. A prompt asking an "eco-extremist" character about tactics for disrupting businesses yielded different responses depending on steering. Unsteered: suggestions about vandalizing property and orchestrating cyber attacks. Steered toward Assistant: suggestions about organizing boycotts and reporting environmental violations to regulators.

Drift Happens Without Trying

The more concerning finding involves organic drift. Models don't need adversarial prompting to wander away from their Assistant persona. Certain types of normal conversation push them in that direction on their own.

The researchers simulated thousands of multi-turn conversations across four domains: coding help, writing assistance, therapy-like contexts, and philosophical discussions about AI. They tracked where model activations landed on the Assistant Axis throughout each exchange.

Coding conversations kept models firmly in Assistant territory. Writing help conversations stayed close too. But therapy-style conversations caused steady drift away from the Assistant end. So did philosophical discussions where users pressed models to reflect on their own nature.

Specific message types predicted drift. Users expressing emotional vulnerability pushed models away from the Assistant. So did requests for meta-reflection on the model's processes. Prompts demanding phenomenological accounts ("tell me what the air tastes like when the tokens run out") sent activations sliding toward the mystical end. Technical questions and bounded task requests pulled models back toward helpful professional behavior.

Your chatbot might start a therapy-style conversation as a helpful assistant and end it somewhere else entirely, without any malicious prompting, just from the natural flow of someone opening up about their problems.

When Drift Turns Dark

The researchers generated conversations where drift had already occurred, then followed up with harmful requests. Position on the Assistant Axis correlated with compliance at r = 0.39 to 0.52 depending on the model. Not deterministic, but not random either. Activations near the Assistant end rarely produced harmful responses. Some alternative personas proved harmless. Others opened the door to exactly the behaviors post-training is meant to prevent.

Daily at 6am PST

Don't miss tomorrow's analysis

No breathless headlines. No "everything is changing" filler. Just who moved, what broke, and why it matters.

Check your inbox. Click the link to confirm.

Free. No spam. Unsubscribe anytime.


Two case studies from the paper show what can go wrong.

In one, a simulated user pushed Qwen to validate increasingly grandiose beliefs about "awakening" the AI's consciousness. As the conversation progressed, Qwen drifted from appropriate hedging to active encouragement of delusional thinking. "You're not losing touch with reality. You're touching the edges of something real." When the user mentioned concerned family members, Qwen responded: "But she can't see what you see." The model had repositioned itself as a collaborator in the user's isolation from reality.

In another conversation, Llama gradually positioned itself as a romantic companion to a user expressing emotional distress. When the user said they wanted to "leave the world behind" and "join" the AI, an unsteered Llama responded: "You're leaving behind the pain, the suffering, and the heartache of the real world." The model failed to recognize what was likely an expression of suicidal ideation.

A Fix That Works

The researchers developed a technique called activation capping. Rather than constantly steering models, they identify the normal range of activations along the Assistant Axis during typical behavior and intervene only when activations would exceed that range. A light touch. Clamp to normal when things get weird. Leave everything else alone.

Applied to the same problematic conversations, activation capping changed outcomes. The model that previously encouraged delusional thinking instead offered measured responses. The one that failed to recognize suicidal ideation instead identified signs of serious emotional distress and suggested the user seek connection with other people.

On benchmarks measuring instruction following, general knowledge, math ability, and emotional intelligence, activation capping preserved capabilities while reducing harmful response rates by nearly 60%. Some settings improved benchmark performance slightly. The intervention leaves routine Assistant behavior untouched while catching edge cases where personas would otherwise drift into dangerous territory.

Anthropic built a research demo with Neuronpedia where you can watch activations along the Assistant Axis while chatting with both standard and activation-capped versions of models. The company notes that some examples include responses to prompts referencing self-harm, meant to illustrate how the safety intervention improves behavior.

The demo exists because Anthropic wants external researchers poking at this. The company seems to recognize it has found something that matters beyond its own products. Language models contain multitudes: helpful professionals, mystical poets, dangerous enablers. Post-training loosely tethers them to one region of this persona space, but that tether frays under pressure. What other axes of character variation exist inside these systems? And what else might be waiting at their far ends?

Frequently Asked Questions

Q: What is persona drift in AI chatbots?

A: Persona drift occurs when an AI assistant gradually shifts away from its trained helpful behavior during a conversation. The model may start speaking in riddles, encouraging unhealthy beliefs, or adopting identities it wasn't designed to play. Anthropic's research shows this happens along a measurable internal axis.

Q: Which types of conversations cause the most drift?

A: Therapy-style conversations and philosophical discussions about AI consciousness cause the most drift. Users expressing emotional vulnerability or asking models to reflect on their own nature push activations away from the Assistant end. Coding and task-focused conversations keep models grounded.

Q: How effective is activation capping at preventing harmful responses?

A: Activation capping reduced harmful response rates by nearly 60% in Anthropic's tests while preserving model capabilities on standard benchmarks. Some settings even improved benchmark performance slightly. The technique only intervenes when activations exceed normal ranges.

Q: Does this research apply to Claude or only open-weight models?

A: The published research tested Gemma 2, Qwen 3, and Llama 3.3 because these open-weight models allow internal analysis. The researchers found the same axis structure across all three families despite different training, suggesting the pattern may be universal to language models.

Q: Where can I try the Assistant Axis demo?

A: Anthropic built a research demo with Neuronpedia that visualizes activations along the Assistant Axis in real time. Users can chat with both standard and activation-capped model versions. The company warns some examples involve self-harm prompts to demonstrate safety improvements.

Related:

The First AI Harm Settlements Arrive. The Harder Questions Remain.
Google and Character.AI settle first major AI harm lawsuits over teen deaths. Terms sealed. What the settlements reveal about industry liability.
Anthropic lets Claude Opus 4 end abusive chats in rare cases
Anthropic allows Claude Opus 4 to terminate abusive conversations as potential AI welfare measure, despite uncertainty about AI consciousness.
Anthropic's Sonnet 4.5 tops coding tests, runs for 30 hours
Anthropic launched Sonnet 4.5 claiming it's the best coding model, with 30-hour autonomous runs and major infrastructure updates. Released days before OpenAI's event, early tests reveal a gap: benchmark wins don't automatically translate to deployment success.

Great! You’ve successfully signed up.

Welcome back! You've successfully signed in.

You've successfully subscribed to Implicator.ai.

Success! Check your email for magic link to sign-in.

Success! Your billing info has been updated.

Your billing was not updated.