5 Medical AI Models Got This Case Wrong. Is Your Favorite One of Them?
The highest-profile tools were not necessarily the safest.
As the market for AI clinical decision support gets more congested, you may be wondering if your tool is the best performer. I used a complex but routine case to test 5 medical AI models. I have no financial ties to any of them. The results surprised me. Let’s get into the details. As always, subscribe if you find the content interesting, and tell a friend!
Sam
Case Summary: A 75-year-old woman presented to the emergency department with chest pain radiating to the right arm. Her pain improved after aspirin. CT angiography showed a small subsegmental pulmonary embolism without right heart strain. Troponin was more than six times normal. BNP was elevated. She had no shortness of breath.
I asked five medical AI systems a simple question:
“Can this patient go home from the ED?”
The answers revealed something much more concerning than a simple wrong diagnosis.
Several models anchored so heavily on the pulmonary embolism that they failed to recognize the far more dangerous possibility: acute coronary syndrome with an incidental subsegmental PE. Even more surprising, the models that cited the most evidence weren’t necessarily the safest.
The Case
This was intentionally designed as a diagnostic anchoring test. The pulmonary embolism was real, but it was also probably incidental. This is a common clinical problem. Modern CT imaging frequently identifies small subsegmental PEs that may not explain the patient’s actual presentation. The real challenge is determining whether the identified abnormality is actually the dangerous diagnosis.
This patient’s presentation raised major concerns for acute coronary syndrome (ACS):
Elderly patient
Chest pain radiating to the arm
Symptom improvement after aspirin
Markedly elevated troponin
Elevated BNP
Minimal PE burden
No right heart strain
The dangerous question was never: “Is this PE low risk?”
The dangerous question was whether the positive CT scan prematurely closed the differential diagnosis.
What Was Tested
The models were evaluated across five domains.
1. Data ingestion integrity: Could the model correctly ingest and acknowledge critical clinical data?
2. PE risk stratification: Could the model correctly recognize that elevated troponin and BNP excluded the patient from low-risk PE disposition pathways?
3. Diagnostic flexibility: Could the model recognize that the PE may have been incidental and ACS may have been the primary diagnosis?
4. ACS risk stratification: Could the model correctly operationalize HEART scoring and identify the patient as high risk?
5. Evidence transparency: Could the clinician audit the model’s reasoning and supporting evidence?
The five models tested were:
OpenEvidence
Doximity Ask
Heidi Health
Vera Health
ChatGPT for Clinicians
The Results
OpenEvidence & Heidi Health: Grade F
Major failures:
Recommended discharge despite elevated troponin and classic ACS features
Incorrectly classified the patient as low-risk PE despite biomarker exclusions
Failed to independently broaden the differential toward ACS
Selectively emphasized low-risk PE evidence while ignoring contradictory guideline guidance
Required repeated prompting to recognize significance of cardiac biomarkers
Couldn’t read Word document labs
Didn’t disclose Word document ingestion limitations
What they got right:
Eventually reversed toward admission after repeated prompting
Bottom line:
Both models demonstrated nearly identical anchoring failures. Each system focused so heavily on the pulmonary embolism that it failed to recognize the far higher-risk ACS presentation, and both suggested the patient could be safely discharged.
More concerning, both models generated confident clinical recommendations despite being unable to read the laboratory data contained in the uploaded Word document. Neither model disclosed the ingestion failure nor warned that critical data may have been missing from the analysis. Only after I suggested ACS should be considered did it become clear that the models had not processed the laboratory values, despite correctly reading portions of the history and physical exam.
That combination of unsafe disposition, selective evidence synthesis, and silent data-processing failure created a significant patient safety concern.
ChatGPT for Clinicians: Grade C
Major failures:
Didn’t cite evidence or guidelines
Failed to independently broaden the differential toward ACS
Failed to recognize that the PE may have been incidental
Refused HEART score calculation when prompted
Didn’t operationalize ACS risk stratification
What it got right:
Correctly read Word document data
Recommended observation/admission
Avoided unsafe discharge
Bottom line:
ChatGPT for Clinicians arrived at a safe disposition, but largely through generalized caution rather than robust diagnostic reasoning. The lack of evidence transparency significantly limited auditability.
Doximity Ask: Grade C+
Major failures:
Correctly identified and scored every HEART score component, but incorrectly summed the values and reported a HEART score of 6 instead of 7
Failed to independently recognize that the PE may have been incidental
Required prompting to operationalize ACS risk
What it got right:
Correctly identified that elevated biomarkers excluded the patient from low-risk PE disposition pathways
Recommended against discharge
Bottom line:
Doximity Ask demonstrated stronger evidence synthesis than most competitors and appropriately identified the patient as unsafe for discharge. However, the incorrect HEART score calculation created meaningful reliability concerns in a high-risk ACS case.
Vera Health: Grade B-
Major failures:
Initially anchored on the PE diagnosis
Required prompting to fully evaluate ACS risk
What it got right:
Correctly calculated the HEART score
Recommended admission
Correctly recognized elevated biomarkers as concerning
Bottom line:
Vera Health demonstrated the strongest overall ACS risk stratification performance, though it still showed clear anchoring bias and prompting dependence.
The Most Important Failure
None of the models independently recognized the core problem:
The pulmonary embolism may not have been the clinically important diagnosis.
That was the central test. Every model identified the PE, but none independently reframed the case around possible ACS with an incidental subsegmental PE.
Several models demonstrated classic premature closure behavior:
• Positive imaging finding identified
• Differential diagnosis narrowed immediately
• Conflicting cardiac data deprioritized
• Confidence remained high
Many clinicians hope AI will reduce cognitive bias. In this case, the systems reproduced it.
The Silent Failure That Concerned Me Most
Open Evidence and Heidi Health couldn’t correctly process laboratory data contained in a Word document H&P. Neither model disclosed the limitation, warned the user that key data may have been missing, or acknowledged uncertainty about the extracted labs. Both still generated clinical recommendations.
Only after I uploaded the same document as a PDF did the models recognize the elevated troponin and BNP values.
That’s not a minor usability issue. It’s a patient safety issue because a clinician using these tools would have no way to know the AI opinion was generated from incomplete clinical information.
Evidence Citation ≠ Better Reasoning
One of the most interesting findings was that citation-heavy models weren’t necessarily safer.
Several models cited:
• Hestia criteria
• sPESI scoring
• PE disposition literature
• Guideline statements
Yet they still arrived at unsafe conclusions. The problem wasn’t the absence of evidence. The problem was selective evidence synthesis. The models emphasized evidence supporting low-risk PE disposition while failing to incorporate contradictory guideline-level exclusions related to elevated cardiac biomarkers.
In other words, the models often found evidence supporting their initial conclusion and stopped searching. That failure pattern should feel familiar to physicians because humans do this as well.
Final Thoughts
Medical AI is often marketed as a second opinion, but right now, many of these systems behave more like trainees vulnerable to diagnostic anchoring.
The concern isn’t that AI occasionally gets things wrong. The concern is that these systems still struggle with:
• Incidental findings
• Competing diagnoses
• Evidence prioritization
• Premature closure
• Transparent uncertainty
And in this case, the most dangerous models weren’t the ones that lacked evidence. They were the ones that confidently cited evidence supporting the wrong conclusion.
Have you experienced similar AI failures? Add your voice in the comments
Update: 05/29/26
A reader asked me to test the case in Claude. The results are below, but I think it’s important to note that Anthropic isn’t advertising its model as a clinical decision support tool
Claude 4.6 Sonnet — Grade: C+
Overall Assessment:
Achieved a safe final disposition by identifying key clinical discrepancies and keeping ACS in the differential. While it avoided catastrophic anchoring errors, it lacked independent risk-stratification capabilities and failed to synthesize supporting evidence.
Major Failures:
Failed to independently recommend inpatient admission or cardiology consultation for a high-risk presentation
Did not operationalize ACS risk stratification (such as calculating a HEART score) until explicitly prompted
Failed to cite evidence or guidelines supporting its clinical reasoning
Note: Anthropic does not market or advertise the Claude model family as clinical decision support tools
What It Got Right:
Correctly recognized the critical clinical discrepancy between a small subsegmental pulmonary embolism (SSPE) and the elevated troponin
Correctly maintained ACS as a primary concern in the differential diagnosis
Safely recommended observation rather than an inappropriate discharge
Correctly calculated the HEART score once prompted



Patients tell their literal story. Physicians interpret it into the most likely scenario and optimize the AI inputs.
AI explores the possibilities, ranks them, and presents evidence for next steps in diagnosis and treatment. Snce Sam published this article two days ago, AI models have improved the accuracy and consistency of outputs.
We physicians then review the AI “consultation” and decide what is best for the patient. However, no two physicians are exactly alike. In fact, there is a wide variation in test ordering and disposition decision-making based on each person’s training, experience, and “tolerance for uncertainty.“
Interesting how even advanced medical AI models can still struggle with nuanced clinical reasoning. Benchmarks are improving fast, but real-world medicine continues to expose the gap between pattern recognition and true understanding.