94% of diagnostic AI studies don’t adequately validate results
The majority of recent journal studies evaluating the performance of AI algorithms failed to adequately validate test results, according to a meta-analysis published in the Korean Journal of Radiology, meaning most of that research can only serve as proof-of-concept and might not translate into clinical performance.
Senior author Seong Ho Park, MD, PhD, and colleagues at Asan Medical Center in Seoul, South Korea, said that with the present influx of new AI technologies, researchers need to take care to validate their models in objective settings.
“As with any other medical devices or technologies, the importance of thorough clinical validation of AI algorithms before their adoption in clinical practice through adequately designed studies to ensure patient benefit and safety while avoiding any inadvertent harms cannot be overstated,” Park and co-authors wrote in KJR. “Clinical validation of AI technologies can be performed at different levels of efficacy—diagnostic performance, effects of patient outcome and societal efficacy that considers cost-benefit and cost-effectiveness.”
It’s important to validate algorithms with adequately sized, fresh datasets that weren’t used for training the technology, the authors said, so the researchers know their model can be extended to other populations and disease states. They also recommended validating algorithms with data from multiple external institutions and designing studies with a diagnostic cohort framework rather than a diagnostic case-control framework.
The team said it can be difficult for researchers to get their hands on the mere mass of data required to validate complex algorithms, so they often attempt to validate their models with whatever data they have on hand. That, though, leads to a final product that is biased and likely won’t represent real-world clinical performance.
“Since the performance of an AI algorithm is strongly dependent upon its training data, there is a genuine risk that AI algorithms may not perform well in real-world practice and that an algorithm trained at one institution provides inaccurate outputs when applied to data at another institution,” Park and colleagues wrote.
The authors scrutinized 516 studies published between January and August of 2018 that investigated the performance of AI algorithms for use in radiology and diagnostics. Articles were assessed to determine whether the researchers adopted a diagnostic cohort design, included multiple institutions in their analysis and prospectively collected data for external validation. Publications were both medical and non-medical, but Park et al. didn’t find any significant differences between the two.
Of all the studies evaluated, just 31—6%—performed external validation, the authors reported, and none of those 31 studies included all three design features deemed necessary for adequate validation. That makes most of the studies proof-of-concept feasibility studies rather than real-world research meant to apply AI in clinical settings.
The authors said that doesn’t mean the studies included in their analysis were inadequate—for work that’s meant to only investigate technical feasibility, they’re fine. But they don’t quite cut it as studies meant to evaluate the clinical performance of an algorithm.
“As AI is a rapidly evolving field with numerous new studies being published, the shelf life of our study results could be short,” Park and colleagues wrote. “Ironically, we hope to see substantial improvements in the design of studies reporting clinical performance of AI in medicine soon. Despite such rapid changes, our research remains meaningful as the baseline against which comparisons can be made to see if any improvements are made in the future, given that most published studies that were analyzed here likely predated the recent release of related methodologic guides.”