AI neural network reasoning and multi-agent processing
AGI Countdown

Meta's Superintelligence Team Unleashes Muse Spark — Multi-Agent Reasoning That Thinks in Parallel

Nine months after gutting its AI stack, Meta's superintelligence team delivers Muse Spark — a model that runs parallel agents to think harder, not longer. The AGI race just got a new player.

Meta AIMuse SparkSuperintelligenceAGIMulti-Agent Systems

Nine months after Meta tore down its entire AI stack and started from scratch, the company’s newly formed Superintelligence Labs has delivered its first model. Muse Spark isn’t an incremental improvement over Llama. It’s a ground-up reset — and it’s betting on a fundamentally different way to scale intelligence.

The headline: instead of making one model think longer, Muse Spark runs multiple agents in parallel, each generating solutions that get refined and aggregated. Meta calls this “Contemplating mode.” It’s the clearest signal yet that the frontier of AI capability isn’t just about bigger models. It’s about smarter orchestration.


What Muse Spark Actually Does

Muse Spark is a natively multimodal reasoning model — meaning it was built from the ground up to process text and visual inputs simultaneously, rather than bolting a vision module onto a language model after the fact. It supports tool use, visual chain of thought, and multi-agent orchestration at inference time.

The architecture reflects three distinct scaling axes:

  • Pretraining: Meta rebuilt its entire pretraining stack with improvements to model architecture, optimization, and data curation. The result: the same capabilities as Llama 4 Maverick with over 10x less compute.

  • Reinforcement Learning: Post-pretraining, RL training amplifies capabilities through outcome-based feedback. Meta reports smooth, predictable log-linear gains in pass@1 and pass@16 — meaning the model improves consistently as RL compute scales, without the instability that typically plagues large-scale RL.

  • Test-Time Reasoning: The model is trained to think before responding, with a twist. RL training maximizes correctness subject to a penalty on thinking time. This produces what Meta calls “thought compression” — the model initially improves by thinking longer, then compresses its reasoning to solve problems using fewer tokens, then extends again for stronger performance.


Contemplating Mode: The Real Innovation

This is the part that matters most. In standard test-time scaling, you make a single agent think for longer. That works, but latency scales linearly with compute.

Muse Spark’s Contemplating mode takes a different approach: multiple agents run in parallel, each producing solutions. Those solutions are iteratively refined and aggregated into a final output. The trade-off is compute for capability, not time for capability.

The benchmarks back this up. On Humanity’s Last Exam With Tools — a benchmark testing expert-level multidisciplinary knowledge — Muse Spark Contemplating scores 58.4, compared to Gemini 3.1 Deep Think’s 53.4 and GPT-5.4 Pro’s 58.7. On FrontierScience Research, Muse Spark reaches 38.3, ahead of GPT-5.4 Pro’s 36.7 and Gemini 3.1 Deep Think’s 23.3.

These aren’t marginal gains. Contemplating mode is achieving frontier-level performance with a fundamentally different compute paradigm.


Where Muse Spark Wins — And Where It Doesn’t

The model’s strongest results come in health reasoning. On HealthBench Hard — 1,000 open-ended health queries — Muse Spark scores 42.8, compared to Claude Opus 4.6 Max’s 14.8, Gemini 3.1 Pro High’s 20.6, and GPT-5.4 Xhigh’s 40.1. Meta collaborated with over 1,000 physicians to curate training data for this capability.

On visual tasks, Muse Spark also shows strength. ScreenSpot Pro — which tests screenshot localization — scores 72.2 (84.1 with Python tools), compared to Claude Opus 4.6 Max’s 57.7 (83.1) and GPT-5.4 Xhigh’s 39.0 (85.4).

But the gaps are revealing. On coding (SWE-Bench Verified), Muse Spark scores 77.4 — behind Claude Opus 4.6 Max at 80.8 and Gemini 3.1 Pro High at 80.6. On GPQA Diamond (PhD-level reasoning), it scores 89.5, trailing both competitors.

The sharpest weakness: abstract reasoning. On ARC AGI 2, Muse Spark scores 42.5 — significantly behind Gemini 3.1 Pro High at 76.5 and GPT-5.4 Xhigh at 76.1. This is the model’s clearest vulnerability.


Why This Signals a Shift in the AGI Race

Muse Spark matters not because it’s the best model on every benchmark — it clearly isn’t. It matters because of the architectural direction it represents.

Meta is explicitly betting that intelligence scaling comes not just from bigger models, but from smarter orchestration of multiple agents working in parallel. This is the same insight driving the broader agentic AI movement, but applied to the model’s own reasoning process.

The 10x compute efficiency claim is also significant. If accurate, it means Meta can train competitive models at a fraction of the cost of its previous generation — and a fraction of what competitors are spending. In a race where training runs cost hundreds of millions, efficiency advantages compound fast.

But the ARC AGI 2 gap is a real problem. Abstract reasoning — the ability to generalize patterns from minimal examples — is arguably the core capability needed for AGI. If Muse Spark is scoring 42.5 while competitors sit above 76, Meta still has a fundamental capability gap to close.


The Bigger Picture

Muse Spark is Meta’s declaration that it’s back in the frontier model race — and it’s doing it on different terms. Chief AI Officer Alexandr Wang’s team chose parallel agent orchestration over brute-force scaling, multi-round refinement over single-shot generation, and health-domain depth over generalist breadth.

The result is a model that’s genuinely innovative in some dimensions and genuinely behind in others. It leads in health. It trails in abstract reasoning. It’s efficient. It’s not dominant.

What makes this worth watching is the trajectory. Muse Spark is explicitly described as “the first in the Muse family.” Meta is investing across the entire stack — including the Hyperion data center built specifically to support further scaling. The 10x efficiency gain means they have room to grow without hitting the same compute walls that constrain competitors.

The AGI race is no longer just about who builds the biggest model. It’s about who orchestrates intelligence most effectively. Muse Spark just made that contest more interesting.


SOURCES

  • Meta AI Blog — Introducing Muse Spark: Scaling Towards Personal Superintelligence
  • MarkTechPost — Meta Superintelligence Lab Releases Muse Spark
  • The Verge — Meta is reentering the AI race with a new model called Muse Spark
Sources: Meta AI Blog, MarkTechPost, The Verge