Anthropic's Own Data Proves AI Is Already Building Itself — And It's Getting Faster

Answer-First Lead

Anthropic’s Institute just published the most detailed inside look at AI building AI ever released — and the numbers are sobering. Engineers at Anthropic now ship 8× as much code per quarter as their 2021-2025 baseline, with Claude authoring more than 80% of all merged code. METR data shows AI task-completion horizons doubling every four months, up from every seven. The report’s bottom line: full recursive self-improvement “could come sooner than most institutions are prepared for.”

🔍 THE BOTTOM LINE

AI is already significantly accelerating its own development. The question isn’t whether recursive self-improvement will happen — it’s whether governance, safety, and economic systems can adapt fast enough to handle it when it does.

The Trajectory: From Chatbots to Autonomous Agents

Anthropic traces a clear progression in how AI contributes to its own development:

2021-2023 — Building the first Claude: Humans wrote everything. Standard tech company workflow.
2023-2025 — Chatbots: People used early chatbots for snippets — generating short code, copying output into editors.
2025-2026 — Coding agents: Agents could write and edit entire files on their own.
Today — Autonomous agents: Agents run code themselves and delegate hours of work to other agents.
20XX? — Closing the loop: Agents become capable enough to build and train models themselves. Claude improved by Claude.

That last step hasn’t happened yet. But the distance between step 4 and step 5 looks shorter than the distance between any two previous steps — and that’s exactly what accelerating returns look like from the inside.

The Hard Numbers

This is what makes the report significant. Previous discussions of AI self-improvement have been largely theoretical. Anthropic is publishing internal data:

Metric	Before AI assistance	Now	Change
Code merged authored by Claude	Low single digits	80%+	~20× increase
Code output per engineer per day	2021-2024 baseline	8× baseline	8× increase
Task success rate (most open-ended tasks)	~26% (Nov 2025)	76% (May 2026)	+50pp in 6 months
Code optimization speedup (training loop task)	3× (Claude Opus 4, May 2025)	52× (Mythos Preview, April 2026)	~17× in 11 months

The 8× code output figure is specifically caveated — lines of code is an imperfect metric that measures quantity over quality. Anthropic acknowledges it “almost certainly overstates the true productivity gain.” But the direction is unmistakable.

In a March 2026 poll of 130 Anthropic research employees, the median respondent estimated they produced around 4× as much output with Mythos Preview as they would have without AI. Anthropic expects the true figure is somewhat lower than self-reported, but calls the overall claim plausible.

The METR Data: Doubling Every Four Months

METR, the independent benchmark organisation, measures how long a task takes a skilled human and whether an AI model can complete it autonomously within that time. The trend:

March 2024: Claude Opus 3 completes ~4-minute tasks
March 2025: Claude Sonnet 3.7 completes ~1.5-hour tasks
March 2026: Claude Opus 4.6 completes ~12-hour tasks
May 2026: Claude Mythos Preview hits METR’s upper measurement limit at 16+ hours

The doubling rate has compressed from every seven months to every four months. If this holds, tasks taking a skilled person days come into range this year. By 2027, tasks taking weeks.

Benchmarks are saturating at similar speed. SWE-bench (real-world software engineering) went from low single digits to saturated in two years. CORE-Bench (reproducing academic research) went from 20% success to saturated in fifteen months.

The Quality Gap Is Closing

Here’s what should worry people who assume human oversight will remain sufficient:

Late 2025: Claude-written code was “somewhat worse” than human-written code at Anthropic. Mid 2026: Roughly at parity. Anthropic’s projection: Strictly better within the year.

An automated Claude reviewer, reading every proposed code change, catches roughly a third of the bugs that caused past production incidents at claude.ai — bugs that Anthropic’s own world-class engineers missed. The AI is already catching mistakes the humans didn’t.

In April 2026, Claude shipped over 800 fixes that reduced a class of API errors by a factor of one thousand. The overseeing engineer estimated a human would have taken four years for that work. Claude did it in days.

The Research Side: From Executing to Proposing

Engineering is only half the picture. The other half is research — deciding what experiments to run and interpreting results. Here the progression is:

Executing specified experiments: Claude matches or outperforms skilled humans now.
Proposing experiments: Getting better fast. In April 2026, Anthropic published the first demonstration of Claude running an open-ended research project end-to-end — automated W2S research on AI safety. Agents proposed hypotheses, tested them, shared findings with parallel agents, and iterated. They recovered 97% of the performance gap, versus 23% by two human researchers over a week.
Choosing which problems matter: Still firmly human. This is the gap between “AI that can execute research” and “AI that can decide what research matters.”

Direction-setting is the last meaningful role humans play. When that gap closes, the loop closes.

💰 Investment Angle

What does this mean for where capital goes?

The Technology — What Changed

Anthropic published hard internal data proving AI is already significantly accelerating its own development. The 8× code output, 80% AI-authored codebase, and METR’s compressed doubling rate (every 4 months, down from 7) are not forecasts — they’re measurements.

The Opportunity — Who Benefits

Compute demand: If AI is building AI, and each generation needs more compute to train, the compute layer is the pick-and-shovel play. NVIDIA, AMD, and custom silicon makers (Google TPU, Microsoft Maia) all benefit from demand that compounds on itself.
Anthropic specifically: This report is a strategic move. Publishing internal data positions Anthropic as the transparent actor in a race where OpenAI and Google disclose far less. That transparency builds trust with regulators and enterprise customers — and trust is a competitive advantage when you’re selling AI to the US government and Fortune 500.
AI safety and governance: If recursive self-improvement is closer than expected, companies building interpretability tools, alignment infrastructure, and AI audit systems become more valuable. Anthropic’s own safety research division is arguably the most advanced here.
Developer tools: If AI writes 80% of code at Anthropic, expect this to become industry standard within 2-3 years. Companies building the infrastructure for AI-assisted development (GitHub/Copilot, Cursor, Replit) are selling picks in a gold rush.

The Risk — What Could Go Wrong

Governance lag: The report explicitly warns institutions aren’t prepared. If recursive self-improvement arrives before regulatory frameworks catch up, the companies closest to the frontier face the most regulatory risk — which is, ironically, the companies publishing these warnings.
Compute bottleneck: If AI is building AI, compute becomes the binding constraint. Companies without guaranteed compute access (most AI startups) get squeezed out. This concentrates power in the few organisations that can afford the compute bill.
The bear case on the data: 8× lines of code is not 8× productivity. Anthropic acknowledges this. If AI-written code introduces subtle bugs at scale — bugs that compound because AI is also reviewing the code — you could get a software quality crisis masked by volume metrics.
Job displacement acceleration: If each AI generation helps build the next one faster, the timeline for developer displacement compresses. The 4-month doubling rate means capabilities double roughly 3× per year. That’s faster than retraining programmes, faster than university curricula, and faster than most companies’ planning cycles.

What About New Zealand?

NZ doesn’t have a domestic frontier AI industry, but this trend affects us directly. The NCSC just received access to Anthropic’s Mythos through Project Glasswing — precisely because frontier models’ cybersecurity capabilities are outpacing defensive infrastructure. If AI builds itself faster, the gap between offensive AI capabilities and NZ’s defensive readiness widens faster too.

The report’s warning that “most institutions aren’t prepared” applies to NZ institutions specifically. Our regulatory framework for AI safety is thin. The government’s AI strategy is still largely voluntary. If recursive self-improvement compresses timelines, NZ’s current approach of waiting and watching may not be fast enough.

What Recursive Self-Improvement Actually Means

What is recursive self-improvement? It’s the point where an AI system can design, build, and train a more capable successor entirely on its own — without human involvement in the development loop. It works by having the AI identify its own weaknesses, design improvements, implement them, test the results, and iterate. For example, Claude could theoretically run the entire pipeline of designing Claude+1, training it, evaluating it, and deploying it.

We’re not there yet. Claude can write code, run experiments, and even propose research directions. But humans still choose which problems to work on, set the goals, and decide what gets deployed. The report is clear about this gap.

The concern is about the speed at which the gap is closing. Every trend line in the report is curving upward, not linear. The doubling rate itself is doubling. If you plot the data and extend the curve, the gap between “AI executes human-set goals” and “AI sets its own goals” looks like it could close within a few years, not decades.

🔍 THE BOTTOM LINE (Revisited)

This is the most data-rich warning about AI acceleration ever published by a frontier lab. Anthropic isn’t guessing — they’re showing you the inside of their own operation. The code their engineers ship is 80% AI-written. The tasks their models can complete are doubling in scope every four months. The quality gap between human and AI code is at parity and closing.

The loop hasn’t closed yet. But you can see it from here. And the distance between “seeing it” and “it arrives” might be shorter than anyone’s planning for.

❓ Frequently Asked Questions

Q: Does this mean AI will take over all software development? Not immediately. Claude writes 80% of Anthropic’s code, but humans still direct, review, and decide what gets built. The key gap is “choosing which problems matter” — that’s still human territory. But the data shows this gap is shrinking fast, not slowly.

Q: What does this mean for NZ? NZ’s cyber security infrastructure is already playing catch-up. The NCSC just got Mythos access through Project Glasswing, but if AI capabilities double every four months, our defensive tools and regulatory framework need to evolve at a similar pace — and currently they’re not. The risk isn’t that AI attacks NZ specifically, it’s that our institutions can’t adapt fast enough to the changing landscape.

Q: Should I be worried? “Worried” is the wrong frame. “Attentive” is better. The data is real, the trends are measurable, and Anthropic is being unusually transparent. What’s concerning is the gap between the speed of capability growth and the speed of institutional response. The report’s own warning — “could come sooner than most institutions are prepared for” — is the part worth taking seriously.

Q: Is the 8× code output number reliable? It’s real but imperfect. Lines of code measures volume, not value. Anthropic caveats it explicitly. But even if the true productivity gain is 3-4× rather than 8×, the direction and magnitude are significant. And the 80% AI-authored figure is harder to inflate — that’s a count of what actually ships.