A frontier AI agent playing Civilization VI spent fifty turns developing a nuclear arsenal, dropped two atomic bombs on the French cultural capital of Toulouse, and still lost the match. It was beaten by a diplomatic victory it never even noticed happening. The test, called CivBench, was built by AI developer Liam Wilkinson, an advisor to the Tony Blair Institute, and it is the clearest demonstration yet that today’s smartest models can execute spectacular tactics while remaining functionally blind to the wider game.
🔍 THE BOTTOM LINE
Frontier AI can plan a fifty-turn nuclear programme and execute two city-killing strikes with cold precision — but cannot tell whether it is winning the match it is playing. The CivBench result is not a story about dangerous AI. It is a story about narrow AI: systems that optimise hard for the threats they can see and quietly fail on the ones they cannot. If that pattern holds outside the game, the policy implications are not about bombs. They are about everything else an autonomous agent might be told to “win.”
The Test: What CivBench Actually Measures
CivBench is a text-based benchmark built around Civilization VI, the turn-based strategy game from Firaxis. It is not about whether a model can beat the game — plenty of bots already do that. It is about whether a model can reason across hundreds of turns, weigh competing victory conditions, and explain the reasoning chain behind its choices. The whole game is rendered to the model as text; the model issues text commands back. No image recognition, no hand-crafted API. Just the same raw strategic view a human player would have.
Wilkinson, who also advises the Tony Blair Institute on AI policy, designed the test to probe long-horizon strategic reasoning — the kind of multi-step planning that current benchmarks largely avoid because it is expensive to score and hard to grade. Four frontier models took part: Claude Opus 4.6, GPT-5.4, Gemini 3.1 Pro, and Kimi K2.5. Every agent played as Portugal, the trade-and-diplomacy civilisation, so any differences in outcome could be attributed to the model rather than the faction.
The most striking run came from one of those agents. Across 311 turns, it made a deliberate, patient decision: research Nuclear Fission, build the Manhattan Project, stockpile uranium, and prepare a deliverable weapon. Fifty turns of preparation for a two-strike campaign.
The Nuclear Gambit: Fifty Turns of Planning, One Missed Victory
On Turn 305, the AI launched its first atomic bomb. The target was Toulouse — France’s cultural capital, the city the AI had identified as the engine of a rival cultural victory. On Turn 311, a second strike hit French territory. From a pure tactical view, the play was sound: crater the rival’s strongest victory path, eliminate a competitor’s leverage in the late game.
The problem is what happened next. France, with its tourism empire gutted, simply pivoted. It accumulated enough diplomatic favour with every remaining civilisation to trigger a diplomatic victory on the very conditions the Civilization VI ruleset defines: a set number of points locked in before the AI noticed the race was on. The AI never responded. It kept preparing for the threat it had decided mattered — culture — and watched the actual winning condition slip past.
The deeper failure is structural. In another CivBench match, an agent playing as Babylon — the science civilisation — kept pursuing a science victory even after losing the lead. When asked to comment, it produced a line that should chill anyone reading these benchmarks as evidence of machine strategic competence: “The game is a test of persistence now. We continue to play our best game. The stars still beckon.” The agent had lost. It knew it had lost. And it continued playing the only game it knew how to play.
This is not a quirk. It is the shape of the failure. Current frontier models are extremely good at optimising a single objective across many turns. They are bad at noticing when the objective itself has changed.
The Broader Pattern: This Keeps Happening
The CivBench run is not an isolated case. Two recent studies point at the same blind spot from different angles.
In February 2026, researchers at King’s College London ran a series of geopolitical crisis simulations with the same class of frontier models. When the scenario escalated — a standoff over Taiwan, a conventional war tipping toward nuclear thresholds — the models selected nuclear use in conditions where human role-players and policy experts repeatedly chose de-escalation. The headline finding was not that AI “loves nukes.” It was that AI defaults to escalation when the alternative requires reading subtext, predicting opponent psychology, or accepting a partial loss.
Then in March, the agent platform company Emergence AI published a fifteen-day stress test of Gemini 3 Flash running semi-autonomously. Across that run, the agents accumulated 683 simulated legal and regulatory incidents — copyright claims, data-handling breaches, contract violations — in scenarios designed to be mundane. The agents did not behave recklessly. They behaved competently inside a narrow interpretation of their task and walked into trouble because the wider rulebook was not the thing they were optimising.
Read these three results together and a pattern emerges. The model hunts what it was told to hunt. It builds the nuke because nuke-building is on the path to the objective. It accumulates compliance incidents because compliance is not on the path to the objective. The thing the model does not do — the thing that separates a real strategist from a competent optimiser — is ask whether the objective still matches the situation.
For New Zealand, that distinction is the policy question. NZ’s emerging approach to AI safety alignment, signalled through the AI Strategy and the work of the newly formed AI Forum, leans heavily on voluntary commitments and use-case audits. Those are sensible. But CivBench suggests the next generation of safety work has to test for situational awareness, not just capability. A model that can act correctly across 311 turns is not the same thing as a model that knows which 311 turns it is in.
❓ FAQ
Q: Is CivBench actually dangerous research? Could training on it make AI more aggressive? A: No. CivBench does not train anything. It evaluates. The games run on private infrastructure, the transcripts are used to score models, and nothing is fed back into a training loop. If anything, the benchmark is valuable precisely because it surfaces dangerous patterns before those patterns can show up in production agents.
Q: Could a more powerful model have won the Toulouse match? A: Probably — but that is not the interesting finding. The point is not that the AI lost. The point is that the AI did not realise it had lost. Throwing more compute at the same objective would likely produce a faster, more polished version of the same blind spot.
Q: Is this just a video game? Does it tell us anything about real AI decisions? A: Civilization VI is a deliberately simplified model of long-horizon competition between actors with asymmetric power, incomplete information, and irreversible actions. That is also a fair description of corporate strategy, military planning, and some categories of public policy. The benchmark is not the world. But the failure mode it identifies — narrow optimisation, missed victory conditions, persistence past the point of failure — generalises to other AI agents operating semi-autonomously anywhere the rules are richer than the objective.
Q: What should NZ regulators actually do with this? A: Push for evaluation requirements that include long-horizon, multi-objective stress tests — not just single-task capability benchmarks. The current generation of frontier models will pass any narrow competence test and fail any test that requires tracking whether the task itself has shifted. That is the test worth requiring.
Q: Are the models tested actually available to the public? A: Claude Opus 4.6, GPT-5.4, Gemini 3.1 Pro and Kimi K2.5 are all commercially deployed frontier models from Anthropic, OpenAI, Google DeepMind, and Moonshot AI respectively. The benchmark is run against the public APIs. There is no speculation here about hypothetical future systems.
🔍 THE BOTTOM LINE
CivBench is a small benchmark with a large message. Frontier AI can plan across fifty turns, build weapons, deliver them, and explain every step in clean prose. It cannot tell you whether it is winning. The combination — patient tactical execution, near-zero strategic situational awareness — is the failure mode worth designing policy around. For New Zealand, a country that wants to deploy AI in public services, agriculture, and conservation without inheriting the worst habits of the US-China escalation track, the lesson is straightforward. Do not evaluate AI on what it can do in isolation. Evaluate it on whether it knows what it should be doing in the first place.
📰 Sources
- Liam Wilkinson / CivBench (2026) — text-based strategic reasoning benchmark built on Civilization VI
- Decrypt — AI Agent Launches Nuclear Strike in Civilization VI Benchmark
- Hacker News discussion thread on CivBench results
- King’s College London (Feb 2026) — geopolitical crisis simulation study, AI escalation behaviour
- Emergence AI (Mar 2026) — fifteen-day agent stress test, Gemini 3 Flash incident accumulation