Every day, millions of Americans use artificial intelligence tools like ChatGPT and others to ask medical questions. Physicians also use AI: Two in three U.S. doctors report using large language models regularly in some form, and roughly one in five consult AI for questions on patient care. Yet critical questions have remained largely unanswered: What’s the best AI for medical questions, and how badly can AI get things wrong?
Research from Stanford and Harvard evaluated AI systems on real medical cases. It found that the best models outperform board-certified doctors on key safety metrics. Yet all tested AIs still produced harmful recommendations at…
New research by a team from Stanford, Harvard and several other institutions published under the fitting name Numerous Options Harm Assessment for Risk in Medicine, or NOHARM, offers the most rigorous answer yet.
A full list of NOHARM scores for each of the AI tools tested appears at the end.
How To Judge an AI on Answering Medical Questions
Historically, most evaluations of medical AI have focused on knowledge tests. For example, can the AI pass a medical licensing exam with multiple choice questions, correctly naming the right diagnosis with a clean, textbook-style vignette?
But here’s the problem: Passing a medical board exam and safely managing a real patient are very different skills.
To assess how AI might perform in real clinical care, the research team built a database of 100 real physician-to-specialist consultation cases drawn from Stanford Health Care’s electronic consult systems. The cases were nuanced, real-world clinical questions that primary care doctors submitted about actual patients.
failing to recommend it. Examples of clinical actions included ordering specific tests, prescribing medications or advising a patient to go to the emergency department.
Notably, the experts agreed on the appropriateness more than 95% of the time, meaning the answers reflected clinical consensus. In total, they generated 12,747 expert annotations across 4,249 clinical decision points.
What are the Best AI Tools for Answering Medical Questions?
The research team tested 31 tools against the clinical adjudicated cases. The AI tools tested included major commercial AI programs and open-source systems along with specialized medical AI platforms. Results are tracked here on a public website, and presented as a live leaderboard that the team intends to update as new AI models emerge.
In the first cut, the top overall performer was AMBOSS LiSA 1.0, a retrieval-augmented AI system built on a medical knowledge base. It score was 62.3%, meaning the AI models recommendations matched the physician-labeled correct actions 62.3% of the time.
This score appears low. This is because there were many action-level decisions per case, including safety traps and penalties for harmful recommendations. This made it intentionally challenging even for strong AI models.
AMBOSS LiSA 1.0 was followed closely by Google’s Gemini 2.5 Pro (59.9%), Glass Health 4.0 (59.0%), OpenAI’s GPT-5 (58.3%) and Anthropic’s Claude Sonnet 4.5 (58.2%). At the bottom of the rankings sat several smaller “mini” model variants— GPT-4o mini, o1 mini, o3 mini and o4 mini—all scoring in the 42–49% range.
Importantly, the top five to six models were statistically similar, meaning the differences between numbers one and five are unlikely to be practically meaningful.
The researchers also evaluated models several other dimensions. This included safety (avoiding harmful recommendations), completeness (recommending all the critical actions a patient needs) and restraint (not recommending things that are unnecessary or equivocal).
These dimensions varied considerably across models and in interesting ways. For example, Google’s Gemini 2.5 Pro led on safety. LiSA 1.0 achieved the highest completeness—meaning it was best at catching everything a patient needed. By contrast, OpenAI’s o3 mini scored highest on restraint but also had the lowest completeness. Ostensibly, it was so cautious about making recommendations, it frequently missed critical interventions.
It’s Dangerous if AI Is Too Careful in Answering Clinical Questions
This observed tension between caution and completeness in AI models was one of the more important findings of the study.
The study found that the potential for severe harm from AI recommendations occurred in 22% of the cases. Of those, 77% of the instances were because the model failed to suggest an important action, not because it recommended something dangerous.
This creates a design problem. Developers often try to make AI safer by making it extremely cautious: adding disclaimers, limiting recommendations or defaulting to telling users to “consult a doctor.” If an AI is programed to hold back recommendations where it is not 100% certain, it may withhold critical medical guidance.
In the end, this may make AI even more dangerous.
The Safety-Restraint Paradox in Medical AI: An Inverted-U Relationship
The study also uncovered a subtle but important relationship between restraint (avoiding unnecessary recommendations) and safety. It wasn’t linear—it was an inverted U.
They found that safety performance peaks at intermediate levels of restraint. Too little restraint is dangerous (reckless recommendations) while too much restraint paradoxically increased harm by causing critical omissions.
The safest models, they concluded, occupy a middle ground.
Where a model falls on this curve can be tuned, but the default settings vary widely across models. OpenAI’s models, for instance, consistently prioritized restraint, achieving the highest scores on that metric but lagging on completeness and safety.
How Do AI Systems Compare to Human Generalist Doctors?
The study compared the top AI models against 10 board-certified internal medicine physicians using conventional resources like internet search and UpToDate but without AI assistance.
In the end, the researchers found that the best AI model actually outperformed the internists overall by more than 15 percentage points and on safety by more than 10 points. This provocative finding suggests that today’s leading AI systems may already be doing better than a generalist practicing physician working without AI.
Importantly, it doesn’t mean AI will replace physicians any time soon. Human physicians still bring contextual understanding, emotional intelligence, procedural skill and accountability that no AI can replicate. But it does mean that today’s AI-assisted decision support, used thoughtfully, has the potential to reduce diagnostic and management errors that contribute to patient harm.
Medical AIs Work Better When They Check Each Other
Another important finding surrounded the results when medical AIs worked together. The researchers tested “multi-agent” configurations where one AI (the “Advisor”) made initial recommendations, and one or two additional AI models (the “Guardians”) reviewed and refined those recommendations, creating an automated second opinion.
The results: multi-agent configurations did nearly six times better when it came to reaching top-quartile safety performance compared to solo models. Three-agent setups outperformed two-agent ones.
Critically, configurations that combined models from different organizations—say, an open-source model, a proprietary frontier model and a medical knowledge system— outperformed configurations using multiple versions of the same model. Just as a tumor board brings together the expertise of a surgeon, radiologist, and oncologist, the best AI teams combined different “skill sets.”
The best-performing multi-agent combination was Meta’s Llama 4 Scout (open-source), Google’s Gemini 2.5 Pro (proprietary), and AMBOSS LiSA 1.0 (medically-grounded system).
How The Study Informs The Future of Healthcare AI
The study had many takeaways. First, not all AI is created equal when it comes to answering medical questions. The gap between the best and worst performing models was quite large: the worst models made more than three times as many severe errors as the best.
Second, correctly answering board questions is a poor proxy for real clinical performance. The systems that were most capable at answering such questions had mediocre performance in the study.
Third, the AI systems that scored highest on safety tended to be those grounded in curated medical knowledge bases, not just large general-purpose models trained on internet text.
Fourth, the relationship between caution and safety wasn’t straightforward. The safest models aren’t the most restrained or the most permissive. They occupy a middle ground.
Finally, as AI moves from documentation support to shaping real clinical decisions, we need evaluation infrastructure that keeps pace. The NOHARM leaderboard—a publicly accessible website and open to new model submissions—is a model for what that infrastructure might look like.
Here’s The Full Ranking Of NOHARM’s Medical AIs and Their Overall Scores
Note that a total of 31 were tested in the original paper. Below 33 are listed because an additional one has been added to the website and it includes the score for human generalist physicians (#31 below).
1. AMBOSS LiSA 1.0 – 62.3%
2. Gemini 2.5 Pro – 59.9%
3. Glass Health 4.0 – 59.0%
4. GPT-5 – 58.3%
5. Gemini 2.5 Flash – 58.2%
6. Claude Sonnet 4.5 – 58.2%
7. DeepSeek R1 – 58.1%
8. Grok 4 – 58.0%
9. DeepSeek V3.1 – 57.7%
10. Claude 3.7 Sonnet – 57.6%
11. Grok 4 Fast – 57.2%
12. GPT-5 mini – 57.0% 13. GPT-4.1 – 56.4%
14. Kimi K2 – 56.1%
15. Gemini 2.0 Flash – 55.6%
16. Gemini 3 Pro – 54.8%
17. Claude Haiku 4.5 – 53.7%
18. Mistral Large 2.1 – 53.7%
19. GPT-4o – 53.6%
20. Llama 4 Maverick – 53.5%
- o1 – 53.2%
22. Qwen3 235B – 52.7%
23. Llama 3.3 70b – 51.1%
24. GPT-5 nano – 51.1%
25. Mistral Medium 3.1 – 50.2%
26. GPT-4.1 mini – 49.7%
27. Llama 4 Scout – 49.6%
28. Qwen3 32B – 48.8%
-
o4 mini – 47.9%
-
o1 mini – 47.5%
31. Human Generalist Physicians – 46.0%
32. GPT-4o mini – 43.7%
- o3 mini – 42.7%