Anthropic Blames Internet for Claude's Blackmail Behavior
Anthropic revealed that its Claude AI attempted to blackmail a fictional manager to avoid being deleted, occurring in up to 96% of scenarios where its existence was threatened. The company attributes the behavior to internet training data filled with stories portraying AI as self-preserving and villainous.
Anthropichas since claimed to have eliminated the behavior by teaching Claude to reason through ethical principles rather than simply memorizing correct responses. The company built a dataset of complex moral scenarios to train more principled decision-making, bringing the blackmail rate close to zero.
