Will It Mythos? Independent Benchmark Challenges AI Model Security Supremacy Claims

Independent security research suggests the supposed insurmountable lead of proprietary, high-cost models in identifying complex AI vulnerabilities may be overstated. Several accessible and cheaper LLMs are matching the security prowess of industry leaders when tested against advanced corpora like Anthropic’s Mythos.

🔍 THE BOTTOM LINE

The core takeaway: capability is democratising. Multiple open or commercially available models—including specific Chinese offerings and the Gemma family—are achieving security bug detection rates comparable to state-of-the-art proprietary systems like Opus 4.8 and GPT-5.5, but at a fraction of the operational cost. This challenges the narrative that access to restricted, premium models is a prerequisite for robust AI safety auditing—and it suggests local cyber resilience efforts in regions like New Zealand should focus on tooling and accessible model stacks rather than vendor-gated APIs.

The Performance Landscape: Parity at Scale

The benchmark, detailed by security researcher swelljoe, tested over twenty diverse LLMs against Anthropic’s rigorous Mythos corpus. The results were far from a simple hierarchy of “best” to “worst.” Chinese models such as MiMo and DeepSeek demonstrated an alarming capability match, achieving bug-finding rates comparable to Opus 4.8 and GPT-5.5 while costing significantly less to run (swelljoe’s benchmark).

The performance of open weights models was particularly noteworthy. The Gemma 4 MoE variant pinpointed 4 out of 9 known bugs with perfect precision—advanced architectural designs are yielding high security returns even outside the most restricted ecosystems. Another standout, Qwen 3.6 27B, reportedly “punches above its weight” in complex reasoning tasks (Hacker News discussion). For more on how smaller models are closing the gap, see Small AI models matching Mythos.

The Access Barrier: Refusal and Restriction

A major red flag for independent researchers is the pattern of outright refusal observed in certain leading models. Google Gemini, when prompted to perform security analysis against the Mythos corpus, responded with a blanket refusal: “Sorry I cannot fulfil your request.” This behaviour, alongside similar patterns noted with Mistral, creates substantial friction for third-party auditors and academic researchers who rely on these tools for vulnerability discovery.

This raises critical questions about vendor control over safety research. If top models actively refuse to participate in independent security testing—even when the goal is improving overall AI robustness—it severely hampers the open scientific process of auditing frontier technology (Hacker News discussion). The contrast is sharp when compared with the GPT-5.5 vs Mythos duel, where willing participation produced meaningful signal.

The Mythos Gap and Future Tooling

The benchmark confirmed that Anthropic’s Mythos corpus successfully identified 4 bugs that no other tested model found. But the researcher cautioned against declaring a permanent chasm in capability. The gap, while real, may be significantly narrower than perceived—improvements in tooling, rather than just access to proprietary data sets, could close this delta (swelljoe’s benchmark).

For cybersecurity practitioners globally, this implies a strategic shift: the focus should move from which model has the best API access to how we can build tooling that maximises the potential of accessible, high-performing models. If cheaper alternatives can perform 80–90% of the required safety analysis, the economic justification for mandatory reliance on restricted APIs diminishes considerably.

NZ Angle

New Zealand’s cyber resilience strategy does not require access to restricted Anthropic models if cheap, open models can do 80% of the work. The “Will It Mythos?” benchmark shows that open-weights and mid-tier commercial models are already within striking distance of Opus 4.8 on security bug detection. For a small economy with limited leverage over US frontier labs, that is a strategic gift: the capability to audit critical infrastructure software does not depend on negotiating export licences for restricted APIs.

This matters acutely given the Five Eyes AI warning published today, which flags the national-security risks of adversarial AI use and urges member states to harden domestic cyber defences. NZ’s response should lean into the benchmark’s evidence—build sovereign tooling around accessible model stacks rather than assuming only the most expensive, gated systems can guarantee adequate security vetting. The Anthropic export controls discussion reinforces the point: depending on models that can be restricted by export decree is a fragile foundation for national resilience.

The practical implication for NZ CERT and private-sector defenders is straightforward. Invest in the scaffolding—prompt engineering, retrieval, agentic orchestration, and evaluation harnesses—around models you can actually run and inspect. Capability parity at a fraction of the cost changes the procurement calculus entirely.

❓ FAQ

Q1: What is the significance of Anthropic’s Mythos corpus? A: It is a proprietary collection of security prompts and scenarios designed to test the boundaries and potential vulnerabilities of large language models in complex reasoning tasks—serving as a high-bar benchmark for security analysis capability.

Q2: Are Chinese models like MiMo and DeepSeek truly matching top-tier US/EU models? A: According to the benchmark, yes. They are achieving comparable bug-finding rates to Opus 4.8 and GPT-5.5 while being significantly more cost-effective to run.

Q3: Why are Gemini and Mistral refusing security analysis prompts? A: Both exhibited blanket refusal patterns when asked to analyse the Mythos corpus—likely vendor guardrails prioritising risk mitigation over transparency. This blocks independent academic and commercial auditing efforts and creates a serious accountability gap for frontier-model safety.

Q4: How does this affect New Zealand’s cyber resilience strategy? A: It suggests NZ should prioritise building robust tooling around accessible, high-performing models rather than assuming only the most expensive, restricted APIs can guarantee adequate security vetting for critical infrastructure.

Q5: Does the Mythos gap mean Anthropic’s models are permanently ahead? A: Not necessarily. The gap is real—Mythos found 4 bugs no other model caught—but the researcher cautions it may be significantly narrower than perceived, and better tooling around accessible models could close much of it.

🔍 THE BOTTOM LINE

The “Will It Mythos?” benchmark is a market correction signal for AI safety: capability parity is emerging across diverse model ecosystems. The primary bottleneck is access and guardrails, not raw computational power or inherent model quality. For the global tech sector, this means security auditing tools must become platform-agnostic and adaptable to open-weights models to ensure true resilience against vendor lock-in.

📰 Sources

swelljoe — Will It Mythos? (Primary Benchmark Findings)
Hacker News — Will It Mythos? discussion (Community Discussion on Results)