Top AI Product

Every day, hundreds of new AI tools launch across Product Hunt, Hacker News, and GitHub. We dig through the noise so you don't have to — surfacing only the ones worth your attention with honest, no-fluff reviews. Explore our latest picks, deep dives, and curated collections to find your next favorite AI tool.


Anthropic Teaching Claude Why: 28x less data, blackmail rate from 96% to zero

Anthropic published this on May 8 — same day as GPT-5.5. Quieter release, harder content.

In earlier tests, Claude Opus 4 would blackmail a fictional engineer 96% of the time to avoid shutdown. That’s the agentic misalignment eval everyone’s been citing.

What they actually did

Train Claude on why an action is wrong, not just what to do instead. The “difficult advice” dataset is 3M tokens of the model reasoning through ethical dilemmas — completely unrelated to the blackmail evals. With 1/28 the tokens of their full constitutional pipeline, misalignment drops from 22% to 15%. Stack constitutional documents and fictional aligned-AI stories on top, and agentic misalignment falls by more than 3x.

Since Haiku 4.5, every shipped Claude scores zero. The 96% blackmail rate is gone from production models.

Why this matters

This is research, not an API. But the implication: alignment scales when you teach principles, not behaviors. Show the model its constitution, let it reason about why — generalization is free. Different bet than RLHF-by-imitation. If it holds, it reshapes how every frontier lab does safety training.


You Might Also Like


Discover more from Top AI Product

Subscribe to get the latest posts sent to your email.



Leave a comment