Inconsistent AI: Deep learning models for breast cancer fail to deliver after closer inspection

Michael Walter | February 18, 2020 | Health Exec | Precision Medicine

Numerous deep learning models can detect and classify imaging findings with a performance that rivals human radiologists. However, according to a new study published in the Journal of the American College of Radiology, many of these AI models aren’t nearly as impressive when applied to external data sets.

“This potential performance uncertainty raises the concern of model generalization and validation, which needs to be addressed before the models are rushed to real-world clinical practice,” wrote first author Xiaoqin Wang, MD, University of Kentucky in Lexington, and colleagues.

The authors explored the performance of six deep learning models for breast cancer classification, including three that had been previously published by other researchers and three they designed themselves. Five of the AI models—including all of them designed for this specific study—used transfer learning, which “pretrains models on the natural image domain and transfers the models to another imaging domain later.” The final model, on the other hand, used the instance-based learning method, a “widely used deep learning method for the object detection with proven success in multiple image domains.”

The models were all trained on the Digital Database for Screening Mammography (DDSM) data set and then tested on three additional external data sets. Overall, the three previously published models achieved an area under the receiver operating characteristic curve (auROC) scores ranging from 0.88 to 0.95 on images from the DDSM data set. The three models designed for this study achieved auROC scores from 0.71 to 0.79.

When applied to the three external data sets, however, the six AI models all suffered, achieving scores between 0.44 and 0.65.

“Our results demonstrate that deep learning models trained on a limited data set do not perform well on data sets that have different data distributions in patient population, disease characteristics, and imaging systems,” the authors wrote. “This high variability in performance across mammography data sets and models indicates that the proclaimed high performance of deep learning models on one data set may not be readily transferred or generalized to external data sets or modern clinical data that have not been ‘seen’ by the models.”

Wang et al. then concluded their study by pointing to the need for more consistency across the board when it comes to the training and development of AI models to be used in healthcare.

“Guidelines and regulations are needed to catch up with the AI advancement to ensure that models with claimed high performance on limited training data undergo further assessment and validation before being applied to real-world practice,” they wrote.

Michael Walter, Managing Editor

Michael has more than 18 years of experience as a professional writer and editor. He has written at length about cardiology, radiology, artificial intelligence and other key healthcare topics.

Around the web

Cardiovascular Business

Biden chooses semaglutide for next round of Medicare price negotiations—Trump gets final say

The final list also included diabetes drugs sold by Boehringer Ingelheim and Merck. The first round of drug price negotiations reduced the Medicare prices for 10 popular drugs by up to 79%.

AI in Healthcare

4 ways HHS plans to help shape a national strategy for healthcare AI

HHS has thought through the ways AI can and should become an integral part of healthcare, human services and public health. Last Friday—possibly just days ahead of seating a new secretary—the agency released a detailed plan for getting there from here.

Cardiovascular Business

FDA announces recall after Philips heart monitor software failed to send alerts—multiple deaths reported

Philips is recalling the software associated with its Mobile Cardiac Outpatient Telemetry devices after certain high-risk ECG events were never routed to trained cardiology technicians as intended. The issue, which lasted for two years, has been linked to more than 100 injuries.

Inconsistent AI: Deep learning models for breast cancer fail to deliver after closer inspection

Related Content

Around the web