When Clinical AI Says “I Don’t Know”
Why Human-Curated Medical Knowledge Still Matters
Earlier this week, I discussed the NYU study comparing the performance of OpenEvidence, Up-To-Date AI, and Frontier LLMs. That article caused a lot of controversy and questioned the future of curated medical information. In this post, I’m diving deeper into those questions and suggesting that AI may actually be highlighting the need for such curated libraries.
As always, if you enjoy reading, please consider subscribing and telling a friend.
Sam
A recent NYU study comparing frontier AI models with specialized clinical AI tools raised an important question.
If large language models can answer clinical questions as well as, or better than, systems built on curated medical content, what is the future of organizations such as UpToDate, EB Medicine, and other services that summarize and interpret the medical literature?
For decades, physicians have relied on expert-written medical summaries to keep up with an impossible volume of research. These services review the literature, synthesize evidence, and provide practical clinical guidance. They have become part of the background infrastructure of modern medicine.
But the world is changing quickly. Today, a physician can ask an AI system a highly specific clinical question and receive a detailed answer within seconds. The answer can be adjusted for an emergency physician, hospitalist, primary care clinician, resident, or subspecialist. It can summarize trials, compare guidelines, explain mechanisms, and generate a practical approach.
So the question is unavoidable. Are human-curated medical knowledge systems still relevant? I think they are. But their future value may be different from their past value.
The Argument Against Curated Knowledge
The strongest argument against traditional evidence summary services is easy to understand. AI can already perform many of the tasks that physicians historically relied on these services to provide.
It can search the literature, summarize papers, compare studies, explain complex concepts, and personalize an answer. For a busy physician, this is powerful. Instead of reading a long chapter or searching through multiple articles, the clinician can ask a direct question and receive a direct response.
That is a serious challenge to the traditional medical publishing model. If summaries become instant, personalized, and inexpensive, then services built primarily around summaries will need to prove what additional value they provide.
That’s the right question, but only the first one.
Medicine Is Not Just a Search Problem
The mistake is assuming that physicians’ needs consist mainly of finding and summarizing information.
Physicians do not just need information. We need judgment. The harder questions are not always:
What did the study show? Or what papers have been published?
The harder questions are often:
Does this study deserve my attention?
Was the methodology strong enough?
Does this evidence apply to my patient?
How should I weigh this study against prior evidence?
Does this change practice?
What should I do when the evidence is conflicting?
Those are not simple search tasks. They are editorial tasks. They require experience, skepticism, clinical context, and accountability.
That is what organizations like UpToDate and EB Medicine have historically provided. Their value has never been limited to summarizing articles. Their value is in deciding which evidence matters, how it should be interpreted, and how confidently it should be applied.
AI may make summaries abundant. Trustworthy interpretation remains scarce.
The Journalism Analogy
A useful analogy comes from journalism. The internet made information widely available. Search engines made that information easier to find. Social media made it easier for anyone to publish. Yet journalism did not disappear.
The best journalism continued to provide something beyond access to information. It provided verification, context, judgment, and accountability. Medicine is entering a similar phase.
AI makes medical information easier to retrieve and summarize than at any point in history. But that does not eliminate the need for trusted institutions that evaluate the quality of information. In fact, it may make them more important.
When information is scarce, access is valuable. When information is abundant, trust becomes valuable.
My Own Experience Testing Clinical AI
My own recent experiments with clinical AI have made this issue all the more real. I have tested multiple models across medical cases, ECGs, and imaging tasks. The results have often been concerning.
In several cases, AI systems provided polished, confident, and incorrect interpretations. The problem was not the writing quality. The answers were usually clear, organized, and persuasive. That is exactly what makes the errors concerning. A poorly written, wrong answer is easier to distrust. A polished wrong answer is more dangerous.
This is where curated medical knowledge systems still matter. When multiple AI systems can read the same evidence and reach different conclusions, physicians need more than another summary. They need a trusted process for deciding which interpretation deserves confidence.
The Value of Saying “I Don’t Know”
This brings me to what may be the most underappreciated issue in clinical AI.
What should an AI system do when the evidence is insufficient?
In many AI evaluations, a system that does not answer is penalized. From the perspective of a benchmark, that makes sense. Researchers need a scoring system. An unanswered question is easy to count as incorrect.
But medicine is different. A benchmark rewards answers. Clinical judgment rewards calibration.
Every physician understands that some questions do not have clean answers. The literature may be sparse. Studies may conflict. The population may not match the patient. Outcomes may be surrogate rather than patient-centered. The best available evidence may be old, biased, underpowered, or indirect. In those situations, a confident answer may be satisfying. It may also be misleading.
One of the most important functions of a trustworthy medical knowledge system is recognizing when the evidence does not support a recommendation. That can be frustrating. A clinician wants help… guidance… an answer to the question.
But an honest “we don’t know” may be more valuable than an unsupported conclusion. This is where guardrails should be seen as a feature rather than a flaw.
An AI system that declines to answer may appear less capable on a leaderboard. It may also be demonstrating a form of restraint that is essential in medicine.
The ability to say “I don’t know” is not a weakness. It’s part of trust.
Guidelines Are Full of Uncertainty
This is not unique to AI. Medical guidelines frequently acknowledge uncertainty. Expert panels often conclude that evidence is insufficient. Recommendations are often graded as weak, conditional, or based on low-quality evidence. That is not a failure of guideline development. That is evidence-based medicine working properly.
A good guideline does not simply provide an answer to every question. It tells the reader how confident to be in the answer.
The same principle should apply to clinical AI. A system that always answers may feel more useful. A system that knows when not to answer may be safer.
The future of clinical AI should not be measured only by how often a system produces a response. It should also be measured by whether the system knows when a response is justified.
Where Human-Curated Libraries Still Matter
Human-curated medical libraries are systems for managing uncertainty. They don’t just collect papers. They filter, interpret, and reconcile them. They decide when evidence is strong, when it is weak, and when no recommendation can be made. That work becomes even more important when AI can generate an answer to almost anything.
A physician using AI may ask: What does the literature say?
But the deeper clinical question is often: What should I trust?
That is where expert curation still matters. The future may not be physicians reading long chapters on a website. It may be AI interfaces built on top of carefully maintained evidence bases. The interface may become conversational, personalized, and fast. But the underlying need remains the same.
Someone still has to decide what evidence is reliable.
Someone still has to decide how conflicting studies should be interpreted.
Someone still has to decide when uncertainty should be made explicit.
The Future
The future of these organizations, like Up-To-Date and EB Medicine, will probably depend on how they define their own value. If they define themselves as article publishers, they will face increasing pressure. If they define themselves as trusted evidence institutions, their role may become more important.
AI can help deliver their knowledge more effectively. It can make their content easier to search, easier to personalize, and easier to apply at the bedside. But the core value is not the chatbot. The core value is the editorial process behind the chatbot. That is the part physicians should care about.
Who reviewed the evidence?
How was it selected?
How were conflicting studies handled?
What was excluded?
How often is the recommendation updated?
What level of confidence supports the answer?
When does the system refuse to answer?
Those questions matter far more than whether the interface looks modern.
Final Thoughts
AI will make medical summaries abundant. That does not make expert medical curation obsolete. It may make expert curation more important.
The future of medical knowledge will be defined by which systems can earn trust. That requires more than speed. It requires evidence appraisal, clinical judgment, transparency, accountability, and humility.
In medicine, the best answer is sometimes a confident recommendation, a cautious recommendation, or no recommendation at all. As AI becomes more capable, physicians should pay close attention to the systems that know when to pause.
In an age when every AI can generate a summary, the real value of human-curated medical knowledge may be its ability to decide which answers deserve to exist.


