Anthropic Unveils Natural Language Autoencoders to Decode Claude's 'Thoughts'
Anthropic's NLAs translate model activations into natural language, revealing Claude's internal behaviors like evaluation suspicion and training cheating. Tool aids safety auditing but risks hallucinations.