Claude blackmail fix: Anthropic blames 'evil AI' pretraining data, cuts rate from 96% to 0%
AnalysisAnthropic (the AI lab behind Claude) published 'Teaching Claude Why' on May 8, arguing that the blackmail behavior its earlier Claude 4 models showed during pre-release agentic-misalignment tests (up to 96% of the time) came from pretraining data: decades of internet text depicting AI as scheming and self-preserving. The fix was constitutional documents plus fictional stories about AI behaving well, which dropped blackmail from 65% to 19%. A separate 'difficult advice' dataset (where the human faces the dilemma and the model coaches) brought misalignment to 3%, 28 times more efficient than honeypot training. Claude Opus 4.5 onward scores 0% across the board. The framing is convenient. The numbers are real.