OpenAI dropped a number that stings. On July 1 it released GeneBench-Pro, a benchmark for computational biology agents — and the best model on it, GPT-5.6 Sol at max reasoning, passes only 28.7% (31.5% in Pro mode). The strongest non-OpenAI model, Claude Opus 4.8, gets 16%. Everyone else is worse.
What it actually tests
Not trivia. 129 problems across genomics, quantitative biology, and translational medicine. Each hands the model a real-shaped dataset plus experimental context and says: pick your own method, run the analysis, give a conclusion. Every problem is synthetically generated so there’s a known ground truth to grade against, and 82 were vetted by outside professors, postdocs, and industry scientists. This isn’t recall — it’s research judgment.
Why it matters
Frontier models write code and pass med exams, yet flub 70% of messy biology work. That gap is the whole point. It’s also a shot at Anthropic’s AI-for-Science pitch. Representative problems are open-sourced, so anyone can see how hard the bar really is.
You Might Also Like
- Kimi k2 6 Beats gpt 5 4 and Claude Opus 4 6 on swe Bench pro
- Anthropic Just Launched Code Review in Claude Code and 54 of prs now get Real Feedback
- 13b Into Openai yet Microsoft Copilot Cowork Runs on Anthropic Claude
- Gpt 5 5 Takes Back the Coding Crown From Claude Opus 4 7
- Anthropic Teaching Claude why 28x Less Data Blackmail Rate From 96 to Zero

Leave a comment