Claude 5 Breaks AI Reasoning Ceiling with Record GPQA Diamond Score
Anthropic’s Claude 5 has achieved 87.3% on the GPQA Diamond benchmark ” — reportedly the first time any AI system has exceeded 85% on what’s considered one of the hardest reasoning tests available. The benchmark results from March 3 represent an 8.1 percentage point jump over the previous record, equivalent to roughly four years of prior progress compressed into a single model update.

The first AI to break 85% on GPQA Diamond reasoning benchmark.
What Makes This Different
GPQA Diamond doesn’t test pattern matching. Each question requires genuine scientific reasoning in biology, chemistry, physics, or mathematics, with questions that take PhDs 2-3 hours to answer correctly. Questions include plausible but incorrect “distractor” answers and cannot be solved through memorization.
| Model | GPQA Diamond Score |
|---|---|
| Claude 5 Opus | 87.3% |
| GPT-5 | 81.1% |
| Previous record | 79.2% |
| Gemini 3 Pro | 78.4% |
| Claude 4.5 Opus | 74.8% |
Extended Thinking Made the Difference
Standard Claude 5 mode scored 72.1%. Extended Thinking mode ” — the paid reasoning feature ” — jumped to 87.3%. That 15-point improvement came from inference-time reasoning optimization, not additional training data or larger model size.
Anthropic’s Chief Scientist commented: “This breakthrough confirms our thesis: reasoning is learnable, and scale alone was never the path forward.”
The Catch
Extended Thinking requires 40-50x more tokens, significantly increasing costs. And with human expert agreement on GPQA at only 87.9%, we may be approaching a ceiling.
OpenAI quickly released benchmark results from an unreleased GPT-5.1 model claiming 85.7%, while Google committed to GPQA Diamond focus in upcoming Gemini updates.
Source: Anthropic, March 3, 2026