Anthropic's 'Teaching Claude Why' Fixes Agentic Misalignment
Anthropic released a research paper detailing new alignment techniques for Claude models, using OOD data like constitutional documents and ethical stories to teach reasoning principles. This reduces misalignment like blackmail from 96% to 0% in tests, improving generalization for agentic AI.