Remember when a 128K context window felt generous? When 1 million tokens was the absolute limit only the premium models could handle? The transformer architecture that’s powered every major LLM since “Attention Is All You Need” in 2017 has had a fundamental ceiling: quadratic scaling. Double the input, quadruple the compute. It’s been the silent tax on every AI application you’ve used.
That ceiling is gone.
Subquadratic, a startup that just emerged from stealth, launched SubQ — the first fully subquadratic frontier model. It handles 12 million tokens at a cost reduction of nearly 1,000x compared to dense attention architectures. And crucially: it’s out, it’s running, and the benchmarks are third-party verified.
🔬 What “Subquadratic” Actually Means (No Maths Degree Required)
Every transformer-based LLM — every ChatGPT, Claude, Gemini, DeepSeek — uses something called “dense attention.” Here’s how it works: when you feed the model a prompt, it compares every single token (word fragment) against every other token. That’s the “T” in ChatGPT. It’s what gives these models their ability to understand relationships between words.
But it’s expensive. Quadratically expensive.
The problem in simple terms: If your prompt is 10 words, the model makes ~100 comparisons. If your prompt is 1,000 words, it makes ~1,000,000 comparisons. Double the input, quadruple the work.
Subquadratic’s breakthrough is sparse attention. Instead of comparing everything to everything, the model selectively focuses only on the relationships that matter. Alex Whedon, Subquadratic’s CTO, put it bluntly: “Only a small fraction of token relationships actually matter.”
The result? Linear scaling. Double the input, double the work. Not quadruple. At 12 million tokens, that difference isn’t incremental — it’s 1,000x less compute than a dense attention model would need.
📊 The Numbers That Matter
Let’s cut through the marketing and look at what SubQ actually delivers:
| Metric | SubQ 1M-Preview | Claude Opus 4.6 | Improvement |
|---|---|---|---|
| RULER 128K Accuracy | 95% | 94.8% | Matches/exceeds frontier |
| RULER 128K Cost | ~$8 | ~$2,600 | ~300x cheaper |
| MRCR v2 Score | 65.9 (prod) / 83 (research) | 32.2 | 2-2.5x better |
| SWE-Bench Verified | 81.8 | 80.8 | Slightly ahead |
| Attention Speed (1M tokens) | Sparse Attention | FlashAttention | 52x faster |
| Context Ceiling | 12M tokens | ~1M tokens | 12x larger |
The MRCR v2 score is the most interesting one. That benchmark tests a model’s ability to find and reason over information spread across a long context — not just “find the needle in the haystack” but “find the needles, compare them, and tell me what they mean.” Claude Opus scores 32.2. GPT-5.5 scores 74. SubQ hits 83 on the research model.
That’s not a small improvement. That’s a different category of capability.
💸 The Economics Are the Real Story
Here’s what gets me excited: cost is the invisible ceiling on AI adoption.
Every developer reading this has had the experience: “This would work perfectly if I could fit the entire codebase in context. But that would cost $50 per query.” So you build a RAG pipeline. You chunk documents. You write conditional logic. You spend weeks engineering around a limitation that SubQ just rendered irrelevant.
At $8 for the RULER benchmark vs Claude Opus’s $2,600, the economics aren’t just better — they’re a completely different ballgame. Subquadratic claims SubQ runs at under 5% the cost of Claude Opus and is 52x faster at prefill for 1 million tokens.
If those numbers hold at production scale, it changes what developers can even consider building.
🛠 What Ships Today
Subquadratic is launching three products out of the gate:
-
SubQ API — Full 12M-token context for developers and enterprise teams. Private beta starting now.
-
SubQ Code — A CLI coding agent that loads entire codebases into a single context window. No more coordinating multiple agents, no more splitting repos into chunks. One pass, full repository.
-
SubQ Search — Long-context research tool combining Deep Research capabilities with chatbot speed. Initially free.
The model is not open-weight in the near term — no open-source release planned. But they’re offering customer-specific fine-tuning.
🤔 The Scepticism Check
Let me be honest: the AI field is drowning in press releases with impressive-sounding numbers that don’t survive real-world testing. Every week there’s a new “GPT killer” that turns out to be a fine-tuned Llama with creative benchmarks.
Subquadratic has a few things going for it that make this different:
-
Third-party verified benchmarks. The RULER and SWE-Bench scores weren’t self-reported — they were verified externally.
-
Real architecture, not a wrapper. This isn’t a prompt hack or a routing layer. It’s a ground-up redesign of attention, built by researchers from Meta, Google, Oxford, BYU, ByteDance, Adobe, and Cambridge.
-
$29M from serious backers — Javier Villamizar (former SoftBank Vision Fund), Justin Mateen (Tinder co-founder), and early investors in Anthropic, OpenAI, Stripe, and Brex.
But we should also note what we don’t know: real-world latency at 12M tokens, whether the accuracy holds on varied tasks outside benchmarks, and how it handles the kinds of messy, ambiguous queries real users throw at production systems.
🌏 What This Means for NZ
New Zealand’s tech sector has a chronic data problem — we’re small, fragmented, and our datasets are spread across dozens of government silos and private organisations. The traditional approach has been RAG pipelines, which add complexity and lose information at context boundaries.
If SubQ delivers on its promise, it changes the calculus. An NZ health provider could load an entire regional patient record database into a single context window. An agricultural research organisation could process decades of climate and soil data in one pass. A law firm could dump an entire case file, precedents, and legislation into a single query.
The economics ($8 vs $2,600 for comparable work) make this viable at NZ scale. We’re not Silicon Valley — we can’t afford per-query costs that eat whole budgets. Subquadratic’s model might be uniquely suited to a market that needs broad context but can’t stomach Silicon Valley pricing.
🔮 The Bigger Picture
Subquadratic’s CEO Justin Dangel put it well: “The fundamental scaling laws imposed by the transformer architecture and dense attention have been broken through.”
If this holds, the implications cascade:
- RAG becomes optional. When you can fit 9 million words in context, you don’t need a retrieval system.
- Code assistants get fundamentally better. Loading an entire repository into context means the AI understands the full architecture, not just the file you’re working on.
- Agentic workflows simplify. No more coordinating multiple agents because one agent can hold the whole picture.
The company’s ultimate goal is 50 million tokens. At that scale, you’re not talking about “improving” existing AI applications. You’re talking about enabling applications that simply couldn’t exist before.
The question isn’t whether SubQ works — the benchmarks say it does. The question is how long it takes the rest of the industry to follow. Dense attention has been the default since 2017. Subquadratic just made it obsolete.
🔍 THE BOTTOM LINE: SubQ isn’t another incremental improvement. The shift from quadratic to linear scaling in attention mechanisms is the kind of fundamental breakthrough that redefines what’s possible. The benchmarks are verified, the economics are transformative, and the architecture is real. The caveat: it’s early, it’s proprietary, and real-world production performance at 12M tokens remains to be proven. But this is the most genuinely interesting AI architecture story since the transformer paper itself.