AI just beat two attending physicians at their own game — and it wasn’t even close.
A study published this week in Science from Harvard Medical School and Beth Israel Deaconess Medical Center found that OpenAI’s o1 model delivered the “exact or very close diagnosis” in 67% of ER triage cases, compared to 55% for one attending physician and 50% for the other. At the most critical moment — the initial triage, when information is scarcest and urgency is highest — the AI’s advantage was most pronounced.
This isn’t a lab toy scoring well on medical board exam questions. This is real patients, real electronic medical records, real emergency room decisions.
What the Study Actually Did
The researchers took 76 real patients who came into the Beth Israel emergency room and compared the diagnoses offered by two internal medicine attending physicians against those generated by OpenAI’s o1 and 4o models.
Crucially, the AI models were given the same raw electronic medical record data available at triage time — no preprocessing, no cherry-picking, no special formatting. The same messy, incomplete information a doctor sees when a patient walks through the door.
Two other attending physicians then assessed the diagnoses without knowing which came from humans and which from AI. Double-blind. Clean methodology.
The result: “At each diagnostic touchpoint, o1 either performed nominally better than or on par with the two attending physicians,” the study said.
As Arjun Manrai, who heads an AI lab at Harvard Medical School and is one of the lead authors, put it: “We tested the AI model against virtually every benchmark, and it eclipsed both prior models and our physician baselines.”
Why This Matters (And Why You Should Be Careful)
Let’s be clear about what this study is and isn’t.
What it is: The most rigorous real-world comparison yet of AI versus physician diagnostic accuracy in emergency medicine. Published in Science, not some conference proceeding. This carries weight.
What it isn’t: Proof that AI should replace ER doctors. Diagnosis is one piece of the puzzle — and arguably not the hardest one. Emergency medicine involves triaging multiple patients simultaneously, reading body language, making split-second calls about resource allocation, and communicating devastating news to families. An AI that’s 67% accurate at diagnosis but 0% capable of holding a frightened patient’s hand isn’t ready to run an ER alone.
But here’s the uncomfortable question: if the AI is consistently more accurate at the diagnostic part, doesn’t the ethical burden shift toward using it rather than ignoring it?
The NZ Angle
New Zealand’s emergency departments are under serious strain. Wait times have been a political football for years. If an AI tool could improve diagnostic accuracy at triage — the bottleneck where everything slows down — the potential impact on patient outcomes here is real.
But NZ’s medical regulatory framework isn’t built for this. The Medical Council of New Zealand governs doctors. Who governs the AI that advises them? We don’t have a clear answer to that question yet, and this study makes it urgent.
The Bigger Picture
This study lands in the middle of a global conversation about AI in healthcare that’s accelerating fast. We’ve already seen AI chatbots driving vulnerable people into dangerous delusions. We’ve watched copyright battles erupt as AI companies help themselves to whatever data they want. Now we have evidence that the same class of technology that can cause harm can also, in the right context, save lives more effectively than trained physicians.
The question isn’t whether AI belongs in medicine. That ship has sailed. The question is how we deploy it responsibly — with the right guardrails, the right oversight, and the honest acknowledgment that “AI-assisted” and “AI-replaced” are very different things.
One of the study’s authors made the most important point: they’re calling for clinical testing, not clinical deployment. The next step isn’t putting o1 in every ER. It’s running controlled trials to see if AI-assisted triage actually improves patient outcomes in practice, not just on paper.
That’s the right instinct. Let’s hope the people controlling the purse strings have the same patience.