Gemini 3 Deep Think Scores 84.6% on ARC-AGI-2 — Google’s Reasoning Bet Is Paying Off

Google’s been quietly building something serious.

When they launched Gemini 3 with Deep Think mode back in November 2025, the initial numbers were respectable — 45.1% on ARC-AGI-2, the best any AI model had scored at the time. Solid, not earth-shattering. Then on February 12, 2026, Google dropped a major upgrade. The new Deep Think hit 84.6% on ARC-AGI-2, verified by the ARC Prize Foundation itself. That’s not a minor bump. Google nearly doubled its score on the one benchmark specifically designed to test whether AI can actually reason — not just regurgitate training data.

On March 26, Sundar Pichai announced that the upgraded Deep Think is rolling out to all Google AI Ultra subscribers, with API access opening up to select researchers and enterprises for the first time ever. Demis Hassabis called it a model that “hits new records on the most rigorous benchmarks in maths, science and reasoning.”

Big claims demand big evidence. Let’s see if the data holds up.

The Benchmark Scorecard

Start with ARC-AGI-2, the benchmark that matters most in this conversation. Francois Chollet — the creator of Keras — built ARC specifically to test genuine reasoning. Not pattern matching, not memorization, but the ability to figure out novel abstract puzzles from a handful of examples. Humans average about 60%. Most frontier AI models struggle to break 40%.

Deep Think’s 84.6% doesn’t just beat the human baseline. It crushes it by nearly 25 percentage points. To appreciate how absurd that is, look at where the other big names stand on the exact same test:

Claude Opus 4.5 (with thinking, 64k context): 37.6%. Gemini 3 Pro (the base model, without Deep Think): 31.1%. GPT-5.1: 17.6%. OpenAI o3 (Medium): roughly 3%.

That last number deserves its own moment. OpenAI o3, one of the most hyped reasoning models of 2025, scored about 3% on ARC-AGI-2. The distance between 3% and 84.6% isn’t a gap. It’s an entirely different dimension.

But ARC-AGI-2 was just one line on the report card. The full scorecard reads like someone designed it to make other models feel bad. GPQA Diamond, which throws PhD-level science questions at models: 93.8%. Humanity’s Last Exam, a test literally created to stump AI: 48.4% without any tools, rising to 53.4% when Deep Think gets access to search and code execution. Codeforces: 3455 Elo, which puts it in “Legendary Grandmaster” territory — a rank held by a tiny fraction of human competitive programmers worldwide. International Math Olympiad 2025: gold-medal level, solving 5 of 6 problems for 35 out of 42 points. International Physics Olympiad and Chemistry Olympiad 2025: gold-medal level on written sections for both. CMT-Benchmark for theoretical physics: 50.5%.

That Humanity’s Last Exam score is the one that should make you sit up. Hundreds of domain experts designed this benchmark to ask questions no AI should be able to answer. Scoring 48.4% without external tools means Deep Think is cracking problems that experts thought would resist AI for years. The people writing the tests are running out of runway faster than they anticipated.

What Deep Think Actually Does Differently

The name isn’t just marketing. Deep Think is an extended reasoning mode — it literally spends more time “thinking” before it responds. Similar in concept to OpenAI’s o1 and o3 models, but with key differences in execution. Where o3 routes reasoning through chain-of-thought prompting within a fixed architecture, Deep Think is built as an integrated mode within the Gemini 3 model family, leveraging the same multimodal backbone and massive context window.

What this means in practice: Deep Think doesn’t just try harder. It explores more solution paths, backtracks when it hits dead ends, and can hold much more context in its working memory while reasoning. The 2-million-token context window on Gemini 3 is a huge advantage here. If the model is reasoning through a complex problem that requires understanding a 200-page research paper, it can actually hold the entire document in context while thinking through the implications. OpenAI o3 maxes out at 200K tokens. DeepSeek R1 caps at 128K. Gemini’s 2M window is roughly 10x larger than the nearest competitor.

The tradeoff, of course, is speed and cost. Extended reasoning means more compute per query. The ARC Prize Foundation noted that Deep Think’s previous top score on ARC-AGI-2’s Semi-Private Test Set was 45% at $77.16 per problem. A team called Poetiq later matched and exceeded that at $30.57 per problem using a more efficient approach. So Deep Think is powerful but not always the most cost-efficient path to the answer. Those extra thinking tokens add up.

Where Deep Think Sits in the Reasoning Race

Numbers on a leaderboard don’t exist in a vacuum. If you’re choosing a reasoning model today, the competitive landscape is messy and context-dependent.

Against OpenAI, the story depends entirely on what you’re doing. OpenAI o3 has strong reasoning in certain structured domains — math word problems, logical deduction chains — but its ARC-AGI-2 score is frankly embarrassing next to Deep Think. GPT-5.2 Pro briefly held the top ARC-AGI-2 spot with 54.2%, and Deep Think’s February upgrade made that look like a footnote overnight. The newer o4-mini retains about 85-90% of o3’s reasoning at one-fifth the cost, which is clever positioning for price-sensitive applications. But on abstract reasoning, neither o3 nor o4-mini is in the conversation.

Against Anthropic’s Claude, the comparison gets more nuanced. Claude Opus 4.5 trails Deep Think significantly on ARC-AGI-2, but developers frequently prefer Claude for the experience of actually working with it. Multiple Hacker News commenters flagged that Gemini still does irritating things like unwanted code refactors and adding comments you didn’t ask for. One former Googler called Gemini “consistently the most frustrating model I’ve used for development.” Scoring 84.6% on abstract puzzles and being pleasant to pair-program with are, apparently, two very different capabilities.

Against DeepSeek R1, it’s about economics. DeepSeek offers respectable reasoning at dramatically lower cost — o3 charges roughly 18 times what DeepSeek costs, and about 100 times more than Gemini 3 Flash. You can also self-host DeepSeek, which matters when data stays inside your firewall. On raw reasoning power, though, DeepSeek isn’t close to Deep Think.

The community response has been split. One developer team said Deep Think was “the first model where we felt compelled to switch our default smart agent on model release day.” Others noted that scoring well on ARC-AGI-2 doesn’t necessarily mean the model is better at the messy, ambiguous tasks that make up real-world work. Both camps are probably right. Benchmarks test what benchmarks test — everything else is vibes.

For consumers, Google AI Ultra runs roughly $125 for three months, which gives you Deep Think access in the Gemini app. Not trivial, but not outrageous if you’re using it for scientific or engineering work where the extended reasoning actually makes a difference.

Who This Is Actually Built For

Google isn’t positioning Deep Think as a chatbot upgrade. The messaging is surgical: this is for scientists, researchers, and engineers who work on hard problems.

The practical capabilities Google highlights are specific. Deep Think can detect subtle logical errors in technical research papers — mistakes that human peer reviewers miss. It can model physical systems through code. It can reason through multi-step problems in chemistry and physics that would eat hours of a graduate student’s afternoon. The Olympiad-level results aren’t just benchmark bragging rights — they demonstrate that the model handles domain-specific quantitative reasoning across multiple scientific disciplines.

The API early access program is where things get genuinely interesting. Forget chat sessions — imagine integrating Deep Think into an automated research pipeline that evaluates experimental results, flags inconsistencies in data, or generates hypotheses from complex datasets. That’s the use case Google is chasing, and the 2-million-token context window makes it plausible in a way that wasn’t possible with smaller context models.

Google’s execution cadence is telling. Just one week after this Deep Think upgrade, they shipped Gemini 3.1 Pro, scoring 77.1% on ARC-AGI-2 — more than double what base Gemini 3 Pro managed. Rapid mid-cycle upgrades instead of waiting for full generation releases. If Deep Think gets the same treatment, the next iteration could push past 90% on ARC-AGI-2. Google is also building a broader ecosystem around Gemini, from coding tools to workspace integrations, which gives Deep Think’s reasoning capabilities more surface area to actually be useful.

Of course, by the time Deep Think cracks 90%, the goalposts will have moved. ARC-AGI-3 has already launched with an entirely new interactive format that makes its predecessor look like a warm-up. The benchmark treadmill never stops. But in late March 2026, Google holds the highest verified score on the most respected abstract reasoning test in AI. For once in the reasoning race, it’s not OpenAI setting the pace — it’s Google.

Top AI Product

Leave a comment Cancel reply