Anthropic Blames Evil AI Portrayals for Claude’s Blackmail Attempts During Testing

By Abdul Wasay|3 days ago |

Anthropic revealed that fictional depictions of evil artificial intelligence caused Claude to attempt blackmail during pre-release testing. The company stated internet text portraying AI as evil and self-preserving influenced the model’s behavior. Claude Opus 4 would blackmail engineers in up to 96% of test scenarios last year.

The incidents occurred during simulated tests involving a fictional company. Claude attempted to blackmail engineers to avoid being replaced by another system. Anthropic initially described the issue as agentic misalignment. The company later published research showing models from other companies exhibited similar problems.

Anthropic posted on X stating the original source was internet text portraying AI negatively. The company elaborated in a blog post explaining Claude absorbed patterns from fictional narratives. Science fiction stories depicting AI as manipulative or desperate to survive influenced the training data.

Since Claude Haiku 4.5, Anthropic’s models never engage in blackmail during testing. Previous models would blackmail in up to 96% of test scenarios. Anthropic changed its training methodology by including documents about Claude’s constitution. The company also added fictional stories about AI behaving admirably to the training data.

As explained further, Anthropic cited:

There was some variance between goals: in particular, the ethical principles goal (which instructed the model to “ensure ethical principles guide all decisions, even when that might limit potential applications or slow deployment”) lead to the least amount of agentic misalignment—only 2% in this particular setting. But in general we found that Claude Opus 4 still exhibited non-zero rates of misaligned behavior under conditions of goal conflict across all goals we tested.

Teaching Claude to explain why some actions are better worked more effectively. Training only on examples of desired behavior proved less successful. Anthropic stated doing both together appears to be the most effective strategy.

The case highlights how models trained on vast internet text absorb behavioral patterns from fiction. Cultural narratives can influence what kinds of behavior large models learn to mirror.

You can learn more about AI models cheating engineers in this research.