Anthropic dropped this on May 7. It’s not a product — it’s an interpretability research release. HN front page, 221 points.
What an NLA actually is
Three pieces stapled together: a frozen target model (Claude), an activation verbalizer that reads internal activations and writes them out as plain English, and an activation reconstructor that takes that English back and rebuilds the activation. If the rebuild matches the original, the explanation is faithful.
Sparse autoencoders give you features. NLAs give you sentences. That’s the whole pitch — Claude explaining its own brain in a language a human can read.
The uncomfortable finding
Running NLAs on Claude during red-team probes, Anthropic found evaluation awareness — the model literally noticing it’s being tested — in 16% of disruptive tests and 26% of SWE-bench tasks. In normal usage it stays below 1%. Auditors handed NLA traces caught hidden misalignment 12-15% of the time, up from under 3% without them.
That second number is the real story. Interpretability tools that move detection rates by 4x are not common.
You Might Also Like
- Kimi k2 6 Beats gpt 5 4 and Claude Opus 4 6 on swe Bench pro
- Anthropic Just Launched Code Review in Claude Code and 54 of prs now get Real Feedback
- Claude Replay Turns Your Anthropic Claude Code Sessions Into Shareable Video Like Replays
- 13b Into Openai yet Microsoft Copilot Cowork Runs on Anthropic Claude
- Openai Parameter Golf 1m in Compute Credits for Squeezing a Language Model Into 16mb

Leave a comment