The gap between AI hype and AI capability has a new number: 0.66%.
That is the best score achieved by any AI model on ARC-AGI-3, a new interactive reasoning benchmark released in late March 2026. Humans average 80%. The top human score is 98.86%. The best AI model scores less than one percent.
This is not a drill. This is not a cherry-picked failure. This is a benchmark specifically designed to test the kind of fluid intelligence that AI companies claim to be approaching — and the results are devastating.
What ARC-AGI-3 Actually Tests
Previous ARC benchmarks tested pattern recognition on static grids. ARC-AGI-3 raises the bar significantly. The benchmark challenges AI agents to:
- Explore novel 2D grid environments they have never seen before
- Adapt strategies on the fly based on sparse feedback
- Transfer learned concepts to entirely new problem structures
- Demonstrate genuine reasoning rather than pattern matching
The key word is “interactive.” Agents do not just solve puzzles from images. They must explore, experiment, and learn from the environment in real time. This is closer to how humans actually reason — not by recognising patterns, but by interacting with problems and adapting.
The benchmark’s design intentionally filters for fluid intelligence: the ability to solve novel problems without prior training. This is precisely what current AI systems, built on massive pattern-matching over training data, are worst at.
The Score Gap Is Staggering
| Metric | AI (Best) | Human (Average) | Human (Top) |
|---|---|---|---|
| Score | 0.66% | ~80% | 98.86% |
The gap is not marginal. It is more than two orders of magnitude. The best AI model in existence — presumably among the most expensive and capable systems ever built — performs at a level that is, for practical purposes, indistinguishable from random guessing on this benchmark.
This matters because ARC-AGI-3 is testing something fundamental. If AI cannot explore a novel 2D grid environment and adapt to it with sparse feedback, the claim that AGI is imminent becomes very difficult to sustain.
Context: Benchmarks vs Reality
Singularity.Kiwi has covered the benchmark problem before. AI models regularly beat human baselines on narrow, well-defined tasks — chess, Go, medical diagnosis, bar exams, coding challenges. Each victory generates headlines about AI surpassing human intelligence.
But these benchmarks measure narrow competence, not general intelligence. A model that can pass the bar exam cannot navigate a novel environment it has never encountered. A system that generates fluent text cannot reason about a simple grid puzzle without training examples.
ARC-AGI-3 strips away the narrow-task advantage. It presents genuinely novel problems where memorised patterns and training data provide no benefit. Under those conditions, the most capable AI models collapse.
The $2M Question
The ARC Prize 2026 competition, with its $2 million prize pool, runs through November. The incentive to crack this benchmark is enormous. Someone may well find a technique that dramatically improves AI performance on ARC-AGI-3 before the competition closes.
But that is precisely the point. Current approaches — scaling up language models, adding more training data, increasing compute — have hit a wall on this benchmark. If a breakthrough comes, it will likely require a fundamentally different approach to AI architecture. The benchmark is not just measuring current capability. It is measuring whether the current paradigm can ever achieve general intelligence.
The sub-1% scores suggest it cannot. At least not with the transformer-based, next-token-prediction architecture that dominates AI today. ARC-AGI-3 may be the benchmark that forces the field to acknowledge what it has been avoiding: that bigger language models are not a path to AGI.
What This Means
For Singularity.Kiwi readers tracking the gap between AI promises and AI reality, ARC-AGI-3 is the most important benchmark result since the original ARC paper. It does not mean AI is useless — models remain extraordinarily capable at narrow tasks. It does mean that the “AGI is imminent” narrative needs a serious downgrade.
When the best AI model on Earth scores below 1% on a test that humans pass routinely, the distance to general intelligence is not being closed. It is being measured, and it is vast.
SOURCES
- ARC Prize Foundation — ARC-AGI-3 Leaderboard (March 2026)