When Clinical AI Meets Independent Evaluation
Real-World Testing Shows a Different Picture
An NYU study was published last week comparing the performance of OpenEvidence, Up-To-Date AI, and Frontier LLMs. Three things stood out to me from this paper:
Frontier models (GPT, Gemini, and Claude) outperformed specialized clinical AI tools across every evaluation.
The performance gap narrowed as the testing became more clinically realistic.
The most important contribution of this paper is the creation of a benchmark built from actual physician questions asked during routine clinical care.
One reason this paper caught my attention is that it mirrors observations from my own recent testing. In a series of evaluations, including OpenEvidence, Doximity Ask, Heidi Health, Glass Health, ChatGPT, Claude, and Gemini, I found that specialized medical AI products rarely demonstrated a clear advantage over frontier models.
5 Medical AI Models Got This Case Wrong. Is Your Favorite One of Them?
11 Medical AI Tools Read These X-rays: Everyone Missed The Pneumothorax
Let’s get into the details of the study.
As always, if you enjoy reading Ashoo Review, subscribe and tell a friend. There’s no better reference.
Sam
A new Nature Medicine paper asks a question many clinicians have been wondering over the past year: Are specialized medical AI tools actually better than frontier models?
The authors compared OpenEvidence and UpToDate Expert AI against GPT-5.2, Gemini 3.1 Pro, and Claude Opus 4.6 using 1,000 benchmark and 100 physician-generated clinical questions.
The frontier models won, but the more interesting story is how the margin changed for the three datasets tested.
The Study Design
The researchers used three different datasets. The first was MedQA, a collection of 500 USMLE-style multiple-choice questions designed to assess medical knowledge.
The second was HealthBench, a collection of 500 open-ended healthcare prompts scored against detailed evaluation rubrics.
The third was the study’s most interesting contribution: the Real Clinical Queries benchmark, or RCQ. For RCQ, the investigators sampled 100 de-identified physician questions from NYU Langone’s HIPAA-compliant GPT environment. The responses were then reviewed by blinded clinicians who rated correctness, completeness, safety, and clarity.
Each dataset is asking a different question.
MedQA asks whether a model can answer a medical exam question.
HealthBench asks whether a model can satisfy a detailed rubric.
RCQ asks whether a physician would find the answer useful.
A Curious Pattern Emerged
The frontier models outperformed the specialized clinical tools across all three evaluations. That part is easy to summarize. What caught my attention was something else.
The performance gap became smaller as the evaluation became more clinically realistic.
On HealthBench, the separation was dramatic.
GPT-5.2 scored 88.0%
OpenEvidence scored 62.6%
UpToDate scored 61.3%
Looking only at those numbers, one could conclude that the frontier models were operating in an entirely different league.
MedQA told a different story.
Gemini scored 97.4%.
OpenEvidence scored 89.6%.
UpToDate scored 88.4%.
The frontier models still led, but the gap was considerably smaller.
Then came the RCQ benchmark.
Here, the frontier models again formed the top tier, with Gemini, GPT-5.2, and Claude receiving the highest clinician ratings. But the differences were narrower.
On a four-point scale
Gemini averaged 3.62
GPT 3.54, Claude 3.52
OpenEvidence 3.24
UpToDate 3.17.
The superiority was statistically significant, but all of the systems generally received favorable ratings. That’s what makes the RCQ findings so interesting. If you only looked at HealthBench, you might conclude that the frontier models were vastly superior. The real-world physician evaluations tell a more nuanced story. Clinicians still preferred Gemini, GPT, and Claude, but OpenEvidence and UpToDate were generally producing acceptable answers as well.
In other words, the ranking remained the same, but the practical distance between the systems became smaller once the evaluation moved closer to actual clinical use.
That matters. This paper doesn’t tell us that specialized medical AI tools are failing. It’s telling us that specialized medical AI tools did not demonstrate a meaningful advantage over frontier models.
Are We Measuring Medicine or Benchmark Performance?
I suspect many readers will focus on who won. But the more interesting question may be why the margin changed between the models.
As AI systems improve, benchmark leaderboards may exaggerate differences that become less noticeable during day-to-day clinical use. A model can be significantly better at satisfying a rubric while being only modestly better when a physician evaluates the final answer.
That does not make benchmarks unimportant. It does suggest that real-world evaluation deserves more attention.
The RCQ dataset is arguably the strongest part of the paper because it moves the discussion closer to actual clinical practice.
An Unexpected Finding in the Methods
One detail that surprised me was buried in the Methods section. The authors built their real-world benchmark by sampling 100 de-identified physician questions from NYU Langone’s HIPAA-compliant GPT environment.
To do that, those interactions had to be recorded and retained somewhere. Researchers were then able to access those logs and use them to create the benchmark.
The paper doesn’t tell us exactly what was stored or for how long, but it does provide evidence that at least some health systems are monitoring and reviewing how clinicians use AI in practice.
That struck me as noteworthy. Much of the public conversation around healthcare AI focuses on model performance.
This paper quietly reveals that large health systems are beginning to accumulate enough real-world AI usage data to study clinician behavior, evaluate tools, and build institution-specific benchmarks.
That may become increasingly important as AI moves from experimentation into routine clinical workflows.
What This Means for Clinical AI
The paper raises a difficult question for the growing number of companies building clinician-focused AI products. What exactly is the advantage being offered?
For years, the assumption has been that medicine requires specialized systems trained, tuned, or wrapped specifically for healthcare. That assumption seems reasonable. Yet in this study, OpenEvidence and UpToDate Expert AI did not outperform GPT, Gemini, or Claude. OpenEvidence is particularly interesting because it has become one of the most recognizable names in clinical AI.
That doesn’t mean specialized medical AI has no value. Clinical workflows involve far more than answer generation. Citation quality, medical content licensing, governance, enterprise support, and workflow integration matter.
Those factors may ultimately prove more important than small differences in answer quality. Still, this paper suggests that specialization alone is no longer enough to assume better performance.
Looking Ahead
The authors showed that real physician questions can be collected, de-identified, reviewed by blinded clinicians, and used to compare AI systems. Medical AI needs more independent evaluation and fewer marketing claims.
The future of AI assessment will likely involve real workflows, real users, and real clinical questions rather than relying exclusively on public benchmarks.
OpenEvidence Responds
On June 14th, 2026, OpenEvidence publicly challenged the study’s conclusions and methodology on X.com.
The company’s critique focused on three areas.
Benchmark contamination. OpenEvidence argues that public datasets such as MedQA have likely been seen by modern frontier models during training, making them a poor measure of real-world performance. - I agree.
HealthBench. The company notes that HealthBench was created by OpenAI and argues that the benchmark rewards stylistic choices that may not reflect meaningful clinical quality. - Likely true.
The RCQ dataset itself. OpenEvidence points out that the physician-query dataset is not publicly available and that limited information is provided regarding question selection, reviewer selection, and dataset construction. The company also notes that the RCQ evaluation was added after peer reviewers criticized the original submission for lacking stronger real-world grounding. - This is valid, but not unusual. As soon as a valid medical dataset is publicly released, it becomes fodder for frontier LLMs to use for training. So it makes sense to keep the content private.
These criticisms are worth considering. At the same time, OpenEvidence’s response highlights an interesting point of agreement.
Both sides appear to believe that benchmark performance is insufficient. The company argues that clinical AI should be evaluated using real-world clinical workflows and meaningful clinical outcomes rather than benchmark leaderboards.
The disagreement is not whether real-world evaluation matters. The disagreement is whether this particular real-world evaluation is convincing. That question will likely require additional independent studies from other health systems to answer.
Final Thoughts
The publication of this paper and the rapid response from OpenEvidence highlight how quickly the conversation around clinical AI is evolving.
Both perspectives contain important truths. Five years from now, few people will remember which model topped the leaderboard in this paper. The more durable contribution may be the demonstration that clinical AI can be evaluated using real physician questions and blinded clinician review. At the same time, the questions raised about benchmark contamination, transparency, and reproducibility deserve serious consideration.
Clinicians do not need another leaderboard. We need evidence. The debate this paper has already generated may prove just as valuable.


