Microsoft's MDASH Found 16 Windows Flaws Humans Missed — and Beat Anthropic's Mythos Doing It

Microsoft just weaponised AI for defense — and it found bugs that eluded every human auditor in Windows’ history

Microsoft has unveiled MDASH (Multi-model Defense Agent System for Hardening), a multi-model agentic security system that discovered 16 previously unknown Windows vulnerabilities — including 4 critical remote code execution flaws in the Windows kernel TCP/IP stack and IKEv2 service. It also scored 88.45% on the CyberGym benchmark, roughly 5 points ahead of the next entry, which includes Anthropic’s Mythos. The cybersecurity AI arms race now has three serious players, and defense just took the lead.

🔍 THE BOTTOM LINE

AI vulnerability discovery has graduated from research toy to production weapon — and the winner isn’t the smartest model, it’s the best system around it.

What is MDASH?

MDASH is Microsoft’s multi-model agentic scanning harness, built by the company’s Autonomous Code Security (ACS) team. Unlike single-model approaches that ask one AI to find bugs, MDASH orchestrates more than 100 specialised AI agents across an ensemble of frontier and distilled models. These agents don’t just find potential vulnerabilities — they debate, validate, deduplicate, and prove them end-to-end.

Think of it as an AI red team that argues with itself before reporting findings, and then builds working proof-of-concept exploits to confirm each one. The pipeline works in five stages:

Prepare — Ingests source code, builds language-aware indices, and draws attack surface models from past commits
Scan — Runs specialised auditor agents over candidate code paths, emitting findings with hypotheses
Validate — A second cohort of “debater” agents argues for and against each finding’s exploitability
Dedup — Collapses semantically equivalent findings
Prove — Constructs and executes triggering inputs to confirm the bug exists

That last stage is the kicker. MDASH doesn’t just say “this looks suspicious” — it builds the exploit to prove it. No false positives on the test driver. Zero.

The Numbers That Matter

16 new vulnerabilities found across Windows networking and authentication stack, including 4 critical RCEs in the Windows kernel TCP/IP stack and IKEv2 service
21 out of 21 planted vulnerabilities found with zero false positives on a private test driver
96% recall against 5 years of confirmed MSRC cases in clfs.sys; 100% in tcpip.sys
88.45% on CyberGym — the public benchmark of 1,507 real-world vulnerabilities, top of the leaderboard, roughly 5 points ahead of the runner-up (which includes Anthropic’s Mythos)

Why This Beats the “Best Model” Approach

Here’s the thing that makes MDASH interesting beyond the benchmarks: the model isn’t the product. The system is.

Microsoft’s own blog makes this point explicitly: “The durable advantage lies in the agentic system around the model rather than any single model itself.” That’s not marketing — it’s architecture.

MDASH runs three tiers of models:

A SOTA (state-of-the-art) model as the heavy reasoner
Distilled models as cost-effective debaters for high-volume passes
A second separate SOTA model as an independent counterpoint

When the auditor flags something and the debater can’t refute it, that disagreement itself becomes a signal. The finding’s credibility goes up. This is adversarial collaboration at AI speed — something no single model, however clever, can replicate on its own.

And it’s portable. When a new model drops, you flip a config toggle and A/B test it. Your plugins, scope files, and calibrations carry over. You ride the frontier without rebuilding the pipeline.

The Three-Horse AI Cybersecurity Race

This is where it gets spicy. We now have:

Anthropic’s Mythos — the offensive AI that found 271 vulnerabilities in Firefox and has governments scrambling to build defenses
OpenAI’s Daybreak — launched the same day Google caught the first AI-built zero-day, with three GPT-5.5 variants aimed at defensive cybersecurity
Microsoft’s MDASH — now the top scorer on the public benchmark, proving multi-model agentic defense is production-grade

Three of the biggest AI companies on earth are now in an arms race over who can secure — or break — code faster. The Daybreak vs Mythos piece we wrote last week framed this as a two-horse sprint. Make that three horses.

For NZ, this matters because our cybersecurity gap is real. The government’s own compliance data shows most NZ organisations aren’t prepared for AI-speed threats. When the offense is AI and the defense needs to be AI too, you can’t staff your way out of it.

The DARPA Pedigree

Several members of Microsoft’s ACS team came from Team Atlanta, which won the $20 million DARPA AI Cyber Challenge by building an autonomous system that found and patched real bugs in complex open-source projects. That’s not a trivial credential — it means this team has been stress-testing AI vulnerability discovery against real-world codebases since before it was trendy.

The lesson they carried into MDASH: frontier language models need serious engineering around them to perform professional-level security auditing. The model is maybe 30% of the solution. The other 70% is the harness, the agents, the debate pipeline, and the domain-specific plugins (like Windows kernel calling conventions and IRP invariants that no LLM has in its training data).

What About the Commerce Dept Deletion?

In a week where Microsoft is proving AI defense works, the US Commerce Department quietly deleted the details of an agreement where Google, Microsoft, and xAI committed to submitting new AI models for government security testing before release. The page was removed without explanation.

The juxtaposition is uncomfortable. AI defense capabilities are accelerating, but the voluntary oversight framework meant to ensure AI models are tested before deployment is being erased from public record. We covered the GUARD Act’s problems last week — this deletion reinforces the pattern.

🔍 THE BOTTOM LINE

MDASH proves that AI cybersecurity defense has crossed from research into production. Microsoft found 16 bugs that every human auditor missed, proved them with zero false positives, and beat the offensive AI leaderboard doing it. But the story isn’t “Microsoft won a benchmark” — it’s that the architecture of multi-model agentic defense is now a replicable, model-agnostic playbook. The model will keep getting replaced. The system endures.

The real question: when both offense and defense are AI-speed, what happens to the organisations — and countries — that can’t afford either?

❓ Frequently Asked Questions

Q: What does MDASH stand for? Multi-model Defense Agent System for Hardening. It’s a codename for Microsoft’s agentic vulnerability discovery and remediation system that uses 100+ specialised AI agents across multiple models to find, debate, validate, and prove security vulnerabilities.

Q: How is this different from Anthropic’s Mythos? Mythos is a single-model offensive cybersecurity tool — it finds vulnerabilities to demonstrate risk. MDASH is a multi-model defensive system — it finds vulnerabilities to fix them, and its architecture uses model disagreement as a quality signal rather than relying on one model’s judgment.

Q: What does this mean for NZ? NZ organisations already face a cybersecurity skills gap. When AI can find and exploit vulnerabilities at machine speed, the only viable defense is AI-speed detection and patching. Tools like MDASH will eventually be available commercially, but the expertise to configure and operate them remains scarce. NZ’s updated AI Blueprint should prioritise cybersecurity AI capability.

Q: Will MDASH be available to the public? Microsoft is currently running a limited private preview. Given that it found critical RCEs in Windows, broad availability will likely be phased and enterprise-first.

Sources

Microsoft Security Blog
Neowin
iTnews
Cyber Kendra
IT Security News

Microsoft's MDASH Found 16 Windows Flaws Humans Missed — and Beat Anthropic's Mythos Doing It

Microsoft just weaponised AI for defense — and it found bugs that eluded every human auditor in Windows’ history

🔍 THE BOTTOM LINE

What is MDASH?

The Numbers That Matter

Why This Beats the “Best Model” Approach

The Three-Horse AI Cybersecurity Race

The DARPA Pedigree

What About the Commerce Dept Deletion?

🔍 THE BOTTOM LINE

❓ Frequently Asked Questions

Sources

Related Articles

OpenAI's Daybreak vs Anthropic's Mythos: The AI Cybersecurity Arms Race Just Went Hot

OpenAI Offers EU Access to Cybersecurity Model as Anthropic Stalls on Mythos

Google Stopped the First AI-Built Zero-Day Exploit — and That Changes Everything