AI tools performed worse on data outside original health system

Deep-learning models trained to detect pneumonia from chest X-rays performed worse when tested on X-rays from outside their original hospital systems, suggesting AI tools should undergo a wide range of testing before being used in clinical settings.

Convolutional neural networks (CNNs) designed to screen pneumonia achieved a better internal than external performance in three out of five natural comparisons, according to a recent study published in PLOS Medicine.

“The performance of CNNs in diagnosing diseases on X-rays may reflect not only their ability to identify disease-specific imaging findings on X-rays but also their ability to exploit confounding information,” the authors stated. “Estimates of CNN performance based on test data from hospital systems used for model training may overstate their likely real-world performance.”

With interest growing about utilizing CNNs in healthcare for computer-aided diagnosis, a research team—led by the Mount Sinai Hospital in New York—decided to assess how well deep-learning models trained at one hospital system generalized to other external hospital systems.   

The study was conducted at the Icahn School of Medicine at Mount Sinai. Researchers trained and evaluated the deep-learning model using more than 158,000 chest X-rays from three institutions: the National Institutes of Health Clinical Center, Mount Sinai Hospital and the Indiana University Network for Patient Care.

While internal performance of the CNNs “significantly exceeded” external performance in most of the comparisons, the deep-learning models were able to “detect the hospital system where an X-ray was acquired with a high-degree of accuracy, and cheated at their predictive task based on the prevalence of pneumonia at the training institution,” according to a press release from Mount Sinai.  

Based on the results, researchers believe AI platforms should be thoroughly assessed, in a variety of real-world situations, to ensure their accuracy.

“Our findings should give pause to those considering rapid deployment of artificial intelligence platforms without rigorously assessing their performance in real-world clinical settings reflective of where they are being deployed,” Eric Oermann, MD, senior author and neurosurgery instructor at the Icahn School of Medicine, said in a statement. “Deep learning models trained to perform medical diagnosis can generalize well, but this cannot be taken for granted since patient populations and imaging techniques differ significantly across institutions.”

""

Danielle covers Clinical Innovation & Technology as a senior news writer for TriMed Media. Previously, she worked as a news reporter in northeast Missouri and earned a journalism degree from the University of Illinois at Urbana-Champaign. She's also a huge fan of the Chicago Cubs, Bears and Bulls. 

Around the web

The tirzepatide shortage that first began in 2022 has been resolved. Drug companies distributing compounded versions of the popular drug now have two to three more months to distribute their remaining supply.

The 24 members of the House Task Force on AI—12 reps from each party—have posted a 253-page report detailing their bipartisan vision for encouraging innovation while minimizing risks. 

Merck sent Hansoh Pharma, a Chinese biopharmaceutical company, an upfront payment of $112 million to license a new investigational GLP-1 receptor agonist. There could be many more payments to come if certain milestones are met.