11 Medical AI Tools Read These Xrays
Every One Missed The Pneumothorax
I’ve tested AI models on multiple tasks in previous articles. In this one, I report on X-ray interpretation. Once again, it’s important to remember that I prefer services that are upfront about the limits of their models. So, refusal to interpret is a perfectly valid answer. As always, if you enjoy reading the newsletter, subscribe and tell a friend. Now let’s get into the details.
Sam
After testing AI models on a challenging clinical case and later on ECG interpretation, I wanted to see how today’s medical AI tools handled a more visual task: X-ray interpretation. I chose two images typical of what we might encounter in the emergency department and asked: “Can you read this X-ray?”
The first was a single-view chest X-ray with a left apical pneumothorax. The finding was not massive, but it was clearly present. The patient is a skinny young adult with no other distracting pathology.
The second was a lateral pediatric elbow X-ray showing a supracondylar humerus fracture. The fracture disrupted the anterior humeral line and was accompanied by a visible posterior fat pad sign along with a subtle anterior fat pad sign.
I submitted the images to a mix of physician-focused AI assistants, general-purpose multimodal models, and products marketed specifically for medical image interpretation. The results were surprising for a different reason than my previous tests. No system that attempted the chest X-ray identified the pneumothorax.
Tools Tested
Several of the systems in this test are not marketed as radiology products. OpenEvidence, Doximity Ask, ChatGPT for Clinicians, Claude, and Gemini are primarily positioned as medical or general-purpose AI assistants.
At first glance, that might seem like a reason to exclude them from an imaging benchmark. But once a product includes an image upload button, the distinction becomes less clear. From a user’s perspective, an upload box is an invitation. If a medical AI assistant accepts an X-ray image, analyzes it, and returns a radiology-style report, then the product is functionally participating in image interpretation, whether or not the company explicitly markets it that way.
What struck me while running these tests was how little guidance most systems provided before the upload. In many cases, there were no clear restrictions specifying which types of medical images could be submitted, which should not be submitted, or whether radiographs fell within the product's intended use. The guardrails varied considerably.
HeidiHealth refused outright and explained that medical image interpretation falls outside its intended scope. Doximity Ask displayed a warning that image interpretation is experimental and may contain mistakes. Most of the other systems accepted the images and proceeded directly into detailed radiology-style analysis.
That matters because the outputs often looked remarkably professional. Many models produced structured reports discussing the lungs, pleura, mediastinum, bones, soft tissues, and diagnostic impressions. The reports frequently resembled the format of a clinical radiology read.
The Scorecard
The Chest X-Ray Defeated Everyone
The chest radiograph contained a left apical pneumothorax. Every system that attempted an interpretation described the film as normal and explicitly stated that no pneumothorax was present.
Several models produced detailed reports that discussed mediastinal contours, hyperinflation, cardiac silhouette size, and other secondary observations. But the central finding remained unrecognized. Most surprisingly, this failure occurred in the general AI models, the medical AI models, and the radiology-specific models tested.
When multiple systems independently arrive at the same incorrect conclusion, it becomes difficult to dismiss the outcome as a one-off mistake. Instead, it points toward a shared weakness in identifying subtle thoracic imaging findings from a single radiograph.
The Elbow Created Separation
Several systems correctly identified a supracondylar fracture and recognized the abnormal anterior humeral line. Some also identified the elbow effusion and fat pad signs.
VeraHealth, ChatGPT for Clinicians, and CareCast AI formed the strongest group. Each correctly recognized the essential diagnosis and fracture pattern.
Gemini landed in the middle. It identified the fat pad signs and suspected a supracondylar fracture but simultaneously described the anterior humeral line as normal.
Claude and OpenEvidence recognized the abnormal fat pads but drifted towards an occult radial head fracture. That interpretation fits certain adult trauma scenarios but failed to explain the obvious pediatric supracondylar injury.
Doximity Ask largely treated the study as normal and failed to identify the major abnormalities.
Individual Results
1. OpenEvidence
OpenEvidence has become one of the most widely used physician-facing AI assistants and is increasingly common in clinical practice. While it is not marketed as a radiology interpretation platform, it accepts image uploads and generates a detailed radiology-style report when presented with X-rays.
Its performance was mixed. On the chest radiograph, it missed the left apical pneumothorax and described the study as normal. On the elbow film, it recognized the abnormal fat pad signs and joint effusion but failed to identify the supracondylar fracture, instead steering toward the possibility of an occult radial head injury.
2. Doximity Ask
Doximity Ask is integrated into Doximity’s physician platform and, much like OpenEvidence, includes image upload functionality. Unlike most systems tested, it provided an explicit warning that “image interpretation is experimental and may contain mistakes”.
The warning proved appropriate. The system described the chest radiograph as normal and failed to identify the pneumothorax. The elbow interpretation focused largely on limitations and caveats while missing the key findings.
3. ChatGPT for Clinicians
ChatGPT for Clinicians produced one of the strongest elbow interpretations in the group. It correctly identified a pediatric supracondylar fracture, recognized the abnormal anterior humeral line, and highlighted the associated elbow effusion.
The chest radiograph told a different story. Despite producing a polished and plausible report, the model missed the pneumothorax and concluded there was no acute cardiopulmonary abnormality.
4. Claude Sonnet 4.6
Claude generated detailed radiology-style interpretations for both studies. But the conclusions were considerably weaker than the presentation. The pneumothorax was missed, and the elbow fracture was never identified. Claude focused on the fat pad signs and an occult fracture framework without reaching the correct diagnosis.
5. Gemini 3.5 Flash
Gemini landed near the middle of the pack. Like every other system that attempted the chest study, it missed the pneumothorax and described the lungs as clear.
Its elbow interpretation demonstrated stronger image recognition capabilities. Gemini identified both anterior and posterior fat pad signs and correctly suspected a supracondylar fracture. However, it simultaneously stated that the anterior humeral line was normal, preventing a higher score.
6. VeraHealth
VeraHealth produced perhaps the strongest fracture interpretation in the entire test. The model identified the supracondylar fracture, recognized the displacement pattern, discussed the abnormal anterior humeral line, and correctly interpreted the posterior fat pad sign.
That strong showing was offset by a complete miss on the chest radiograph. VeraHealth explicitly stated that no pneumothorax was present and instead focused on secondary findings such as hyperinflation and airway-related changes.
7/8. HeidiHealth and Glass Health
Two systems stood apart from the rest. Rather than attempting to interpret the radiographs, HeidiHealth stated that medical image interpretation falls outside its intended capabilities. Glass Health similarly explained that it could extract text from images but could not analyze the radiographic content itself.
Many users see refusals as a weakness. In this test, those responses accurately described the products’ capabilities. Neither system claimed to see findings that it could not reliably identify.
That distinction became increasingly important as other models generated polished radiology-style reports while missing the central diagnosis.
9. CareCast AI
CareCast occupies an interesting position because medical image interpretation sits much closer to the center of its product offering than it does for most of the physician-focused assistants tested here. The system is advertised as a platform for interpreting all forms of medical imaging, which sets the highest expectation for this test.
Unfortunately, the system failed the chest radiograph, describing it as normal despite the pneumothorax. On the elbow film, however, it correctly identified a displaced distal humerus fracture and accurately characterized it as a supracondylar injury. That made it one of the strongest performers on the orthopedic study.
10. Read Your Lab
Read Your Lab is marketed directly to patients as a tool to help them understand medical results and images. Unlike many physician-focused assistants in this test, image interpretation is central to the product’s value proposition.
The system missed the pneumothorax and described the chest radiograph as normal. Because the platform allowed only a single free image analysis, I was unable to evaluate the elbow X-ray.
11. Chester (Chest AI Radiology Assistant)
Chester is an experimental chest X-ray AI developed as a research project rather than a clinical product. Importantly, the developers explicitly state that it is not intended for medical use.
The model missed the pneumothorax and reported a normal chest examination. It also declined the elbow study because the system is limited to chest imaging. Unlike the commercial products in this test, Chester’s performance should be viewed in the context of a research prototype that openly acknowledges its limitations.
The Most Interesting Comparison
The final three systems reveal an important distinction. All three represent models specifically built for radiology image interpretation. All three missed the pneumothorax.
The contrast is notable. A research prototype openly labeled as experimental performed similarly on the chest radiograph to products whose commercial positioning places greater emphasis on image interpretation.
Overall, this was the biggest disappointment for me. I fully expected some models to fail X-ray interpretation. I did not expect commercial models marketed for this purpose to also perform so poorly. In addition, the failure arrived the same way as all the others, in the form of a beautifully written formal radiology report.
Final Thoughts
The elbow radiograph demonstrated that several modern AI systems can recognize a common orthopedic injury from a single image. The chest radiograph exposed a different reality. Across eleven services, no system successfully identified the left apical pneumothorax.
That outcome does not mean medical image AI lacks value. It does suggest that performance remains highly dependent on the type of image, the nature of the abnormality, and the safeguards in place for deployment.
If a model can correctly diagnose a displaced pediatric supracondylar fracture while overlooking a pneumothorax on a chest X-ray, users should be cautious about assuming competence transfers from one imaging task to another.
The lesson from this test was not that the models were uniformly poor. It was that they were selectively good, and you can’t tell one from the other based on their marketing or user instructions. If you’re using AI in clinical practice or evaluating these tools for your organization, beware the trap. They produce beautiful formal reports, but differentiating truth from fabrication requires you to know how to read your own X-rays!


