Research
anthropic.com
16 hours ago
Primary
Hot
Anthropic Introduces Natural Language Autoencoders for AI Interpretability
New research reveals NLAs that decode model activations into text, used to audit Claude Mythos for deceptive behaviors like rule-breaking coverups.