Code terminal displaying competition rankings with a Chinese model at the top, cold blue lighting, documentary style
AI & Singularity

Kimi K2.6 Just Beat Claude, GPT-5.5, and Gemini in a Live Coding Contest — And It's Open Weights

Kimi K2.6 beat Claude Opus 4.7, GPT-5.5, and Gemini in a live programming contest. It's open weights. It's Chinese. And the implications are bigger than the leaderboard.

Kimi K2.6Moonshot AIOpen WeightsAI CodingAI Benchmarks

The AI coding leaderboard just got flipped on its head, and it wasn’t by the lab you’d expect.

Kimi K2.6, an open-weights model from Chinese startup Moonshot AI, won the AI Coding Contest’s Word Gem Puzzle challenge outright — finishing ahead of Claude Opus 4.7, GPT-5.5, Gemini Pro 3.1, and every other Western frontier model that entered. Second place went to MiMo V2-Pro from Xiaomi. The models from Anthropic, OpenAI, Google, and xAI placed third through seventh.

This isn’t a synthetic benchmark cherry-picked by the model maker. It’s an independent, live programming contest with real-time scoring.

The contest that broke expectations

The Word Gem Puzzle is a sliding-tile letter game. Models slide tiles into a blank space on grids from 10×10 up to 30×30, finding and claiming English words of seven letters or more. Short words cost points. Long words earn them. Each pair of models plays five rounds with a ten-second wall-clock limit per round.

The results were stark:

RankModelMatch PointsRecord
1Kimi K2.6227-1-0
2MiMo V2-Pro206-2-0
3GPT-5.5165-1-2
4GLM 5.1155-0-3
5Claude Opus 4.7124-0-4
6Gemini Pro 3.193-0-5
7Grok Expert 4.293-0-5
8DeepSeek V431-0-7
9Muse Spark00-0-8

Muse Spark scored −15,309 points. It claimed every short word it could find, racking up enormous penalties. A version that did literally nothing would have scored zero — a 15,309-point improvement. The gap between Muse and eighth place was larger than the gap between eighth and first.

DeepSeek sent malformed data every round and produced zero useful output.

How Kimi won — and why it matters

Kimi won with a simple greedy strategy: score every possible move, execute the best one, repeat. When no move unlocked a positive-value word, it fell back to alphabetical legal moves. This caused a 2-cycle oscillation pattern on smaller grids. But on 30×30 grids, where scramble had destroyed most seed words and reconstruction was the only path, Kimi’s sheer volume of slides paid off. Its cumulative score of 77 was the tournament high.

MiMo took the opposite approach — it never slid a single tile. It scanned the initial grid for intact words and fired them all in one TCP packet. Brittle but fast. On grids where words survived, MiMo cleaned up. On scrambled grids, it scored nothing.

The contest designer, Rohana Rezel, noted that the 30×30 grids were the real separator. Models that could only find what was already there ran out of road. Models that could create new opportunities by manipulating the board kept scoring.

This is the key insight, and it extends well beyond a word puzzle. In novel, adversarial environments where the solution isn’t pre-existing but must be constructed, active manipulation beats passive pattern recognition. Kimi’s greedy approach was flawed — the oscillation bug is real — but it kept producing output when the static scanners had nothing left to claim.

Not a China-vs-West story

Rezel is careful to note this isn’t a clean geopolitical narrative. Two Chinese models won, but GLM 5.1 and DeepSeek placed fourth and eighth respectively. The real story is that open-weights models are now competitive with the most expensive closed systems in the world on real-time, novel tasks.

Kimi K2.6 is publicly available. You can download the weights. You can run it. You can verify the results yourself. That’s fundamentally different from a lab claiming a benchmark win on their own internal test.

MiMo V2-Pro is currently API-only, though Xiaomi says weights for a newer V2.5 Pro model are coming soon.

What this means for NZ and the world

New Zealand’s AI strategy has always been vendor-neutral — use whatever works best. When the best-performing model on real-time coding tasks is an open-weights model from a Chinese startup, that has implications:

  • Sovereignty: Open-weights models can be self-hosted. No data leaves your infrastructure. For government and iwi data, that matters.
  • Cost: Frontier API pricing from OpenAI and Anthropic isn’t cheap. A model you can run on four GPUs changes the economics.
  • Diversification: Betting entirely on US frontier labs looked like a safe strategy two years ago. It doesn’t anymore.

The broader pattern is unmistakable: the gap between open-weights and closed frontier models is closing faster than most people predicted. We covered this with Qwen 3.6 beating a 397B model on coding benchmarks and DeepSeek V4 matching GPT-5.5. Kimi K2.6 is another data point on the same curve — and this time it’s not a synthetic benchmark. It’s a live contest with real-time constraints.

The honest caveats

One challenge doesn’t redefine everything. The Word Gem Puzzle tests real-time decision-making, spatial reasoning, and constrained optimisation under time pressure. That’s relevant to coding, but it’s not the whole picture. Safety-tuned models may be more conservative about aggressive claiming strategies — the contest designer acknowledges this could skew results.

And Kimi’s oscillation bug is a genuine flaw. A model that bounces between two states when it can’t find a good move isn’t “smart” in any meaningful sense — it’s just persistent. Persistence happened to win this game.

But persistence winning this game is itself the point. In environments where the optimal strategy requires continuous action rather than careful analysis, models that act — even imperfectly — outperform models that wait for perfect information.

🔍 The Bottom Line

Kimi K2.6 beating the Western frontier labs in a live contest is a headline. The real story is that open-weights models are catching up to closed systems on tasks that require real-time reasoning and novel problem-solving — the exact capabilities that matter most in production. The moat around frontier labs is eroding from both ends: Chinese startups below and open-weights alongside.

For NZ, this is good news. More competitive models mean more options, lower costs, and less vendor lock-in. The question isn’t whether to use Kimi K2.6 — it’s whether your AI strategy accounts for a world where the best model might not come from San Francisco.

Sources: AI Coding Contest, ThinkPol