Active Inference System Scores 95% on ARC-AGI-3 — While Every Other Agent Collapses to Near Zero

When ARC-AGI-3 launched on March 25, 2026, it was supposed to settle the question: can AI reason like humans?

Instead, it accidentally revealed something more interesting — a completely different kind of AI architecture that doesn’t just beat the benchmark, but adapts when the benchmark changes underneath it.

What Happened

AIX Global Innovations, a company most people in AI had never heard of, ran their SeedIQ system against the ARC-AGI-3 developer toolkit starting March 13 — twelve days before the official launch.

Within two days, SeedIQ solved all three available games with a perfect 100% score. 20 out of 20 levels. 536 total actions. Top human-level performance.

That alone would have been remarkable. But then things got weird.

The Benchmark Changed — Quietly

On March 21, five days before launch, the complexity of the LS20 game increased dramatically. New mechanics appeared. Constraints that didn’t exist before suddenly mattered. “Pushers” that shoved your agent around. Sprites that moved like Snitches in Harry Potter. Carryover penalties where early inefficiency punished later levels.

There was no announcement. No changelog entry. The public leaderboard continued showing old scores.

For deep learning agents, this was catastrophic. The ARC Prize Foundation’s own best test agent, Stochastic Goose, had previously scored 12.58% across two games with over 255,000 actions. After the changes? It dropped to 0.25%. Other agents that had been scoring 2–8% collapsed to effectively zero.

As of writing, no agent has exceeded 0.50% on the official leaderboard.

But SeedIQ? It solved the harder version of LS20 again — this time in 433 actions (the human baseline is 546). Overall across all three games under the new difficulty: 95% score, 681 actions. Still in the top five human-level performances.

And it ran on a MacBook Pro M1. No GPU clusters. No token expenses.

How Is This Different?

SeedIQ is not a deep learning system. It’s not a reinforcement learning system. It’s not built on transformers.

It’s built on active inference — a framework derived from the Free Energy Principle, where systems maintain and update structured beliefs about their environment in real time. SeedIQ uses what AIX calls Adaptive Multiagent Autonomous Control (AMAC): multiple agents that learn, plan, and adapt locally while remaining coherent at the system level.

The key difference: deep learning systems depend on prior data and learned patterns. When the environment changes in ways they haven’t seen, performance doesn’t degrade — it collapses. Active inference systems construct world models on the fly, updating beliefs continuously as conditions shift.

That’s why SeedIQ adapted when the benchmark changed and other agents didn’t.

The Participation Problem

SeedIQ’s scores don’t appear on the official ARC-AGI-3 leaderboard. The reason highlights a structural tension in how AI benchmarks work.

ARC-AGI-3 requires participants to submit their full codebase, methodology, and implementation details — with terms allowing anyone to reuse and commercialize submitted work. For open-source deep learning projects, this is standard. For a proprietary system built on patent-pending technology in a trillion-dollar industry, it’s a non-starter.

AIX offered to participate under sealed evaluation — no prize money, just verification. ARC Prize declined.

This raises a question the AI community will need to grapple with: if the most capable proprietary systems can’t participate without giving away their IP, what exactly is the benchmark measuring?

ARC Prize founder François Chollet recently raised $40M+ for Ndea, his own AGI lab. A benchmark that requires full code disclosure while its founder builds a competing company creates, at minimum, reasonable questions about incentive alignment.

What This Means

Regardless of the benchmark politics, the results are striking:

Deep learning agents scored near zero on interactive reasoning tasks when the environment shifted
An active inference agent maintained top human-level performance under the same conditions
The performance gap isn’t incremental — it’s a complete breakdown vs. sustained competence

This doesn’t mean SeedIQ is AGI. AIX is explicit about that. What it does suggest is that the path to adaptive, generalizable intelligence may not run through ever-larger language models. Active inference — building systems that maintain coherent world models and adapt in real time — represents a fundamentally different approach that doesn’t require the compute arms race.

For New Zealand, where AI development has focused heavily on LLM applications, this is worth paying attention to. A paradigm that runs on a laptop instead of a data center changes who gets to build AI and where.