Anthropic Reveals Natural Language Autoencoders to Decode AI Thoughts
Anthropic's new research trains Claude to translate internal activations into readable text, revealing planning like rhymes or rule-breaking intent. Used for safety testing Mythos and Opus models. NLAs provide interpretable insights into AI reasoning.