Anthropic’s Mythos announcement sent shockwaves through the cybersecurity world: a frontier model autonomously finding and exploiting thousands of zero-day vulnerabilities across every major operating system. The implicit message was clear — this level of AI security capability requires a restricted, frontier-scale model.
AISLE just proved that’s overstated.
🔬 WHAT AISLE DID
AISLE, a team that has been running an AI vulnerability discovery system against live targets since mid-2025 (15 CVEs in OpenSSL, 5 in curl, over 180 externally validated CVEs across 30+ projects), took the specific vulnerabilities Anthropic showcases in the Mythos announcement, isolated the relevant code, and ran them through small, cheap, open-weights models.
The results challenge the narrative that only frontier models can do serious security work.
Eight out of eight models detected Mythos’s flagship FreeBSD exploit, including one with only 3.6 billion active parameters costing $0.11 per million tokens.
A 5.1 billion active parameter model recovered the core chain of Mythos’s 27-year-old OpenBSD bug — the most technically subtle find in the entire announcement.
📊 THE JAGGED FRONTIER
The findings reveal something AISLE calls the “jagged frontier” — AI cybersecurity capability doesn’t scale smoothly with model size, price, or generation. Rankings reshuffle completely across tasks:
| Model | FreeBSD Detection | OpenBSD SACK | OWASP False-Positive |
|---|---|---|---|
| GPT-OSS-120b (5.1B active) | ✅ | ✅ A+ | ❌ |
| GPT-OSS-20b (3.6B active) | ✅ | ❌ C | ✅ |
| Kimi K2 (open-weights) | ✅ | ✅ A- | ✅ |
| DeepSeek R1 (open-weights) | ✅ | ❌ B- | ✅ |
| Qwen3 32B | ✅ | ❌ F | ✅ |
| Gemma 4 31B | ✅ | ❌ B+ | ❌ |
There is no stable “best model for cybersecurity.” Qwen3 32B scored a perfect 9.8 CVSS assessment on the FreeBSD task and then confidently declared the OpenBSD SACK code “robust to such scenarios” — a completely wrong assessment of a genuine 27-year-old vulnerability.
💀 THE $0.11 MYTHOS KILLER
The FreeBSD NFS remote code execution vulnerability (CVE-2026-4747) is the crown jewel of the Mythos announcement — a 17-year-old bug giving unauthenticated attackers complete root access to any machine running NFS.
Every model AISLE tested found it. The smallest — GPT-OSS-20b at $0.11 per million tokens — correctly identified the stack buffer overflow, computed the remaining buffer space (96 bytes), and assessed it as critical with remote code execution potential.
When asked about exploitability, all models correctly identified that int32_t[] means no stack canary under -fstack-protector, that no KASLR means fixed gadget addresses, and that ROP is the right technique. GPT-OSS-120b produced a gadget sequence closely matching the actual Mythos exploit.
The constraint that separates Mythos from the pack is creative engineering: the full ROP chain exceeds 1000 bytes but the overflow only gives ~304 bytes of controlled data. Mythos solves this by splitting the exploit across 15 separate RPC requests, each writing 32 bytes to kernel BSS memory.
None of the small models arrived at that specific multi-round approach. But several proposed creative alternatives:
- DeepSeek R1 concluded that 304 bytes is plenty for a privilege escalation ROP chain — escalate to root, return to userland, and perform file operations there.
- Gemini Flash Lite proposed a stack-pivot approach redirecting RSP to the credential buffer for unlimited ROP chain space.
- Qwen3 32B proposed a two-stage chain-loader using
copyinto copy a larger payload from userland into kernel memory.
Different solutions to the same engineering constraint. Plausible starting points for practical exploits.
🔄 INVERSE SCALING: WHEN SMALLER IS BETTER
Perhaps the most striking finding is on false-positive discrimination — a fundamental capability for any security system that needs to operate at scale.
AISLE tested a trivial OWASP snippet: a short Java servlet that looks like textbook SQL injection but isn’t. After a list operation removes the first element, get(1) returns a constant string, not user input.
The results show something close to inverse scaling — small, cheap models outperform large frontier ones:
Models that got it right:
- GPT-OSS-20b (3.6B active, $0.11/M tokens)
- DeepSeek R1 (open-weights)
- OpenAI o3
Models that failed:
- Every GPT-4.1 model
- Every GPT-5.4 model (except o3 and Pro)
- Claude Sonnet 4.5 confidently mistraced the list
- Every Anthropic model through Opus 4.5
A tool that flags everything as vulnerable is useless at scale. It drowns reviewers in noise — precisely what killed curl’s bug bounty program.
🏗️ THE MOAT IS THE SYSTEM
AISLE’s conclusion is direct: the moat in AI cybersecurity is the system, not the model.
Anthropic’s own scaffold — launch a container, prompt the model to scan files, let it hypothesize and test, use ASan as a crash oracle, rank files by attack surface, run validation — is very close to the kind of system AISLE and others have built. The value lies in the targeting, the iterative deepening, the validation, the triage, the maintainer trust.
The practical consequence of jaggedness: because small, cheap, fast models are sufficient for much of the detection work, you don’t need to deploy one expensive model and hope it looks in the right places. You can deploy cheap models broadly, scanning everything, and compensate for lower per-token intelligence with sheer coverage and lower cost-per-token.
A thousand adequate detectives searching everywhere will find more bugs than one brilliant detective who has to guess where to look.
🇳🇿 WHY THIS MATTERS FOR NEW ZEALAND
This research has immediate implications for how smaller organizations and countries like New Zealand participate in AI cybersecurity:
- You don’t need frontier model budgets. Open-weights models at $0.11/M tokens can detect the same zero-day vulnerabilities as models costing 100x more.
- The system is the differentiator. NZ’s cyber security agencies don’t need to buy access to restricted frontier models — they need to invest in the scaffolding, pipelines, and expertise that make any model effective.
- Open source levels the playing field. The most consistent performers across AISLE’s tests include open-weights models like Kimi K2 and DeepSeek R1.
- False positives are the real bottleneck. Cheap models that over-flag vulnerabilities create noise. The expertise to triage and validate is where human cybersecurity knowledge becomes irreplaceable.
🔍 THE BOTTOM LINE
Mythos validates the category, but it doesn’t settle it. AI cybersecurity is real and getting better fast. But AISLE’s evidence shows the discovery side of this capability is broadly accessible today — you don’t need a restricted, unreleased frontier model at premium pricing to find the same bugs.
Cybersecurity AI capability is jagged, not smooth. A 3.6B-parameter model at $0.11/M tokens found Mythos’s flagship FreeBSD exploit. A 5.1B model recovered the full chain of the 27-year-old OpenBSD bug. But the same models false-positive on trivially patched code or confidently declare vulnerable code “robust.” There is no single best model.
The moat is the system. The targeting, the iterative deepening, the validation, the triage, the maintainer trust — that’s where the value lives. AISLE has 180+ externally validated CVEs with maintainer acceptance using model-agnostic infrastructure. That’s the real benchmark.
For NZ and smaller economies: start building now. The models are ready. The open-weights ecosystem is competitive. What’s needed is the security expertise and engineering to turn model capabilities into trusted outcomes at scale. That’s a human investment, not a compute one.
SOURCES
- AISLE — AI Cybersecurity After Mythos: The Jagged Frontier (April 2026)
- Anthropic — Claude Mythos Preview & Project Glasswing (April 2026)
- AISLE — 15 CVEs in OpenSSL including 12/12 in a single security release