Open Source LLMs Will Catch Up by December — On One Benchmark. The Rest Tells a Different Story

The headline is irresistible: open weights LLMs will close the gap on closed models by December 3, 2026 — at least according to one composite index. The catch is the index. Across the other 17 benchmarks that make up the broader picture, the gap between open and proprietary hasn’t moved in a year. Coding converges. Reasoning, maths, and agentic workflows do not.

🔍 THE BOTTOM LINE

Jamie Dborin’s Doubleword analysis traces the curve on Artificial Analysis data and projects parity on the AA Intelligence Index by late 2026. But the average gap across all 18 benchmarks is pinned at roughly five months behind closed leaders — and has been for over a year. Pick your benchmark carefully. The story you tell about open source AI depends entirely on which line you draw.

What the Data Shows

The Artificial Analysis Intelligence Index is a composite score — a weighted blend designed to give a single number for “general capability.” On that measure, open weights models are climbing fast. Dborin’s projection, drawn from Artificial Analysis data, places the open/closed crossover around December 3, 2026. It’s a compelling chart and a tidy headline.

It’s also one chart out of eighteen.

The wider view is brutal. When you average across the full suite of benchmarks Artificial Analysis tracks — coding, math, reasoning, agentic tasks, long context, multilingual — the open/closed gap has sat at roughly a five-month lag for the entire observation window. Some benchmarks are closing. Some are flat. A few are widening. The headline-making “convergence” is happening on the one metric designed to smooth out exactly the differences that matter most.

The Coding Anomaly

Coding is where open weights is genuinely winning. The gap has collapsed from an estimated 15 months behind proprietary leaders to one to two months by the end of the projection period. This isn’t a fluke — it’s a structural shift driven by a flood of coding-specialised open releases. Recent breakthroughs like Kimi K2.6 Just Beat Claude, GPT-5.5, and Gemini in a Live Coding Contest — And It and the agentic advances covered in Mistral Medium 3.5 Launches with Remote Agents — and Open Weights are symptoms of the same trend: when the training data is plentiful, the eval is well-defined, and the task is verifiable, open weights catches up fast.

The pattern is familiar. Wherever AI capability can be measured against ground truth — pass/fail test suites, deterministic output, clear right answers — open models close the gap. Wherever the evaluation is fuzzy, multi-step, or requires sustained reasoning chains, the gap holds.

Why Benchmarks Lie

Different benchmarks tell wildly different stories, and the question that surfaced on Hacker News cuts to the bone: does benchmark convergence mean real capability convergence?

History says no. The LLM field has repeatedly seen “parity” declared on a benchmark, only for the next generation of evaluations to expose the gap that the old metric had papered over. MMLU saturation gave way to GPQA. HumanEval saturation gave way to SWE-bench. Each time the closed labs move the goalposts — and each time the open weights community scrambles to catch up on the new metric while the old one becomes useless.

The five-month average gap is what matters. It’s statistically robust precisely because it doesn’t rely on any single test. Coding scores can spike; reasoning scores don’t. The composite tells you the ceiling; the average tells you the floor.

NZ Angle

For a small, sovereign-minded market like Aotearoa, the message is sharper than the headline. Don’t buy “open weights has caught up” as a general claim. Buy it as a per-task claim.

If you’re a Kiwi startup building a code-generation tool, an automated test runner, or a deterministic pipeline — open weights is genuinely competitive today, and getting more so. If you’re building anything that requires sustained multi-step reasoning, complex maths, or long-horizon agentic planning, the closed frontier labs still have a meaningful lead. Plan procurement, hosting costs, and product roadmaps around the specific workloads you actually run, not around a composite index that papers over the difference.

The Other Side

To be fair to the optimists: the December projection isn’t wrong, it’s narrow. The AA Intelligence Index is a real measure, the trend is real, and the fact that open weights is converging on it at all would have seemed implausible two years ago. DeepSeek’s release cadence, Mistral’s enterprise push, and Meta’s Llama updates have collectively changed the slope of the curve.

But “we hit parity on one benchmark” is not the same statement as “open source has caught up.” The closed frontier is still moving. New benchmarks are still being introduced. And the five-month average gap — the one number that doesn’t get a press release — hasn’t budged. Optimism is warranted. Triumphalism is not.

🔍 THE BOTTOM LINE

Open weights LLMs are winning on coding, winning on determinism, and winning on the one composite index designed to reward exactly that. On the harder, fuzzier, more reasoning-heavy benchmarks, the gap to closed frontier models is locked at around five months and shows no sign of closing. The singularity isn’t here. It’s just unevenly distributed across the benchmark suite.

❓ FAQ

Q: Does “open weights” mean “open source”? A: No. Open weights means the model parameters are downloadable — you can run, fine-tune, and inspect them. True open source also requires open training data and open training code. Most “open weights” releases (Llama, Mistral, DeepSeek) are open weights, not open source.

Q: Why does coding converge faster than reasoning? A: Coding has verifiable ground truth — code either compiles and passes tests, or it doesn’t. That gives training pipelines a clear signal and gives benchmarks a tight feedback loop. Reasoning and math benchmarks are harder to grade, have fuzzier answers, and resist the same kind of fast iteration.

Q: Is the December 3, 2026 date reliable? A: It’s a linear extrapolation of the current trend on one specific composite index. Treat it as a projection, not a prediction. If a closed lab releases a step-change model, or a major open weights release shifts the slope, the date moves.

Q: Should I switch from closed to open weights based on this? A: Per workload, maybe. For coding tools and deterministic pipelines, yes — the economics and capability now favour open weights. For reasoning-heavy or agentic workloads requiring long-horizon planning, the closed frontier still has a real edge. Don’t replace your stack wholesale; replace the parts where the benchmarks actually say you can.

Q: What about new benchmarks being introduced? A: Expect them. Every time open weights closes the gap on a benchmark, a harder one tends to surface. SWE-bench replaced HumanEval. GPQA replaced MMLU for frontier evaluation. The goalposts move, and the average gap is the only number that survives the move.

📰 Sources

Doubleword blog: Frontier Open Source LLMs analysis — Jamie Dborin’s projection of the December 2026 AA Index convergence
Artificial Analysis — source data for the 18-benchmark comparison and the Intelligence Index
Hacker News discussion (id 48692966) — community debate on whether benchmark convergence means real capability convergence