ARC-AGI-3 Turns AI Testing Into a Video Game — And Every Frontier Model Is Losing

For seven years, the ARC benchmark has been the one test that AI couldn’t brute-force its way through. While GPT-series models saturated MMLU and climbed SWE-bench leaderboards, ARC remained stubbornly unsolved — a set of abstract puzzles designed to measure genuine reasoning rather than pattern recall. Now, the ARC Prize Foundation has thrown out the old format entirely. ARC-AGI-3, launched on March 25, 2026, replaces static grid puzzles with interactive, video-game-like environments. It’s the biggest shift in AI evaluation since the benchmark’s creation in 2019, and the early results suggest frontier models are even further behind humans than anyone expected.

From Grid Puzzles to Interactive Worlds

The original ARC (Abstract Reasoning Corpus), created by Keras inventor Francois Chollet, presented AI systems with simple visual grids. Given a few input-output examples, models had to figure out the transformation rule and apply it to a new input. It was elegant and effective — but after years of research, the format was showing its age. Teams found ways to engineer solutions that scored well without truly generalizing.

ARC-AGI-3 abandons this approach completely. Instead of static puzzles, it drops test-takers into more than 1,000 levels spread across 150+ distinct environments. These are simple game-like scenarios — not AAA video games, but purpose-built interactive worlds with their own physics and rules. There are no instructions. No tutorials. No hints. You enter an environment and have to figure out what’s going on, what the goal is, and how to achieve it.

This tests something fundamentally different from previous benchmarks. ARC-AGI-3 requires exploration (gathering information from the environment), planning (deciding on a sequence of actions), memory (storing learned information for future use), and goal acquisition (figuring out what you’re even supposed to do when the objective isn’t stated). Every single environment is hand-crafted and novel, which means memorization is useless.

Why Action Efficiency Matters More Than Accuracy

Perhaps the most significant innovation in ARC-AGI-3 is how it measures performance. Traditional benchmarks ask a binary question: did the model get the right answer? ARC-AGI-3 instead measures action efficiency — how many actions does it take a test-taker to complete a level, compared to the human baseline observed during testing?

This is a subtle but important distinction. Two agents might both solve a level, but if one takes 200 actions and a human takes 15, that tells you something meaningful about how well the agent actually understood the environment versus how much it was flailing around. The total score is calculated as the sum of individual level scores divided by the total number of levels, producing a percentage between 0% and 100%.

During the developer preview competition that ran before the official launch, this scoring method revealed a stark divide. Most humans beat the games comfortably, with persistent players even “speed-running” levels to near-theoretical minimums. AI agents, on the other hand, struggled to make efficient progress. They could sometimes stumble to a solution, but the path there was wildly inefficient — a clear sign that they weren’t truly understanding the environments.

The Frontier Model Scoreboard Tells a Brutal Story

To understand why ARC-AGI-3 matters, look at where frontier models stand on ARC-AGI-2 (the previous, easier version). The numbers are not flattering:

Claude Opus 4.5 (Thinking, 64k): 37.6% — the top-performing commercial model
Gemini 3 Pro: 31.1%
GPT-5.1: 17.6%
OpenAI o3 (Medium): ~3% — a shocking drop from its strong ARC-AGI-1 performance

For context, the average human scores 60% on ARC-AGI-2, and every single task in the benchmark has been solved by at least two humans in no more than two attempts. The best refinement solution (built on Gemini 3 Pro by researcher Poetiq) hits 54% but costs $30 per task — not exactly practical.

Now consider that ARC-AGI-3 is designed to be significantly harder than ARC-AGI-2, with its interactive format adding entirely new dimensions of difficulty. If the best models can barely crack 40% on static puzzles, interactive environments that require real-time exploration and planning could push scores dramatically lower.

Chollet’s Shifting AGI Timeline

Francois Chollet has long been one of the AI community’s most prominent skeptics of the “scaling is all you need” thesis. His core argument: current AI systems excel at interpolation (applying learned patterns to familiar situations) but fail at extrapolation (handling genuinely novel problems). ARC was built specifically to test this distinction.

What’s notable is that Chollet has recently shortened his AGI timeline — from roughly 10 years to about 5. This isn’t because he thinks current approaches are working. Rather, he believes the field is close to a paradigm shift. ARC-AGI-3 is designed to be the measuring stick for that shift, specifically targeting what Chollet calls “fluid intelligence” and “skill-acquisition efficiency.”

The ARC Prize Foundation, co-founded by Chollet and Zapier co-founder Mike Knoop, is backing this vision with serious money. ARC Prize 2026 puts over $2 million on the table through a Kaggle competition, with $1 million earmarked for anyone who can open-source a solution to ARC-AGI-2. The combination of a new interactive benchmark and significant prize money signals that the foundation is betting on a breakthrough moment in AI research.

How ARC-AGI-3 Compares to Other Benchmarks

The AI evaluation landscape has no shortage of benchmarks, but most of them test fundamentally different things:

Benchmark	What It Tests	Status
MMLU	Academic knowledge across 57 subjects	Nearly saturated by top models
SWE-bench	Real-world software engineering tasks	Competitive but narrow
GAIA	Multi-modal real-world problem solving	Broader scope, created by Meta/HuggingFace
ARC-AGI-2	Static abstract reasoning	Best model at 37.6%, humans at 60%
ARC-AGI-3	Interactive reasoning, exploration, planning	Just launched — scores expected to be very low

The key difference: most benchmarks test knowledge or task execution within familiar domains. ARC-AGI-3 tests whether a system can learn and adapt in real-time when dropped into a completely unfamiliar situation. It’s closer to what happens when you hand a person a game they’ve never seen and watch them figure it out.

Getting Started With the ARC-AGI-3 Toolkit

The ARC Prize Foundation has released a developer toolkit that lets researchers interact with ARC-AGI-3 environments in multiple ways: locally (running at up to 2,000 FPS for rapid iteration), through an online interface, or via a hosted API. The toolkit supports a competition mode required for official ARC Prize 2026 Kaggle submissions.

The first-place winner of the developer preview competition, researcher Dries Smit, published a detailed breakdown of his approach — and even the winning solution highlighted how far current methods are from human-level performance on these interactive tasks.

FAQ

What exactly is ARC-AGI-3?
ARC-AGI-3 is an interactive reasoning benchmark created by the ARC Prize Foundation. Unlike traditional AI tests that use static questions, it uses video-game-like environments where AI agents must explore, learn rules, plan actions, and achieve goals — all without being told what to do.

How is ARC-AGI-3 different from ARC-AGI-2?
ARC-AGI-2 used static grid puzzles where models analyzed input-output pairs. ARC-AGI-3 is fully interactive — agents take actions in real-time environments, requiring exploration, memory, and planning. It also measures action efficiency (how quickly you solve problems) rather than just correctness.

Is ARC-AGI-3 free to use?
Yes. The developer toolkit is freely available and supports local execution, online play, and API access. The ARC Prize 2026 competition runs on Kaggle with over $2 million in total prizes.

Which AI models perform best on ARC benchmarks?
On ARC-AGI-2, Claude Opus 4.5 leads commercial models at 37.6%, followed by Gemini 3 Pro at 31.1%. The average human scores 60%. ARC-AGI-3 scores have not been widely published yet, but early preview results suggest AI performance will be significantly lower than on ARC-AGI-2.

What are the main competitors to ARC-AGI as an AI benchmark?
GAIA (by Meta and HuggingFace) tests real-world multi-modal problem solving. MMLU tests broad academic knowledge but is largely saturated. SWE-bench tests coding ability. ARC-AGI specifically targets fluid intelligence and novel problem-solving — a gap that other benchmarks don’t directly address.

Top AI Product

Leave a comment Cancel reply