An artificial intelligence (AI) system performed very similarly to 101 radiologists in detecting breast cancer on digital mammography (DM), a large multi-center, cancer-enriched study of mammograms found.
AI was statistically non-inferior to the average of 101 radiologist readers, according to Ioannis Sechopoulos, PhD, of Radboud University Medical Centre in Nijmegen, the Netherlands, and colleagues.
At the range of low-to mid-specificity, AI had a slightly greater area under the receiver operating characteristic curve (AUC), a plot of the true-positive rate against the false-positive rate: 0.840 (95% CI 0.820-0.860) versus 0.814 (95% CI 0.787-0.841) for radiologists (difference 0.026, 95% CI –0.003 to 0.055), they reported in the Journal of the National Cancer Institute.
In addition, the AI system had an AUC higher than that of 62 out of 101 (61.4%) radiologists, as well as higher sensitivity than 55 out of 95 (57.9%) radiologists. Its performance, however, was consistently lower than that of the best radiologists.
“Our results clearly show that recent advances in AI algorithms have narrowed the gap between computers and human experts in detecting breast cancer in digital mammograms,” the authors wrote.
Designed to parallel a population-based screening setting, the retrospective study drew on nine multi-reader, multi-case datasets previously used for different research purposes in Sweden, the U.K., the Netherlands, Italy, the U.S., Spain, and Austria. The AI vendors were Siemens, General Electric, Hologic, and Sectra.
Each dataset consisted of DM exams acquired with systems from the four vendors, multiple radiologists’ assessments per exam, and ground truth verified by histopathological analysis or follow-up. In total, there were 2,652 exams (653 malignant) and 28,296 independent interpretations among the 101 radiologists.
The mammograms were examined for a level of suspicion of cancer presented ranging from 1 to 10. Performance between AI and radiologists was compared using a non-inferiority null hypothesis at a margin of 0.05. Because the data were enriched with both cancer and benign lesions, the screening recall operating point of radiologists was fixed at the mid-range in specificity, and AI achieved higher sensitivity than the majority of radiologists.
According to the authors, the study’s large and heterogeneous case sample suggested the findings might hold true for different lesion types, mammographic systems, and country-specific practices.
“In a population-based screening setting, the possibilities of work flow enhancement via implementation of an AI system are ample,” Sechopoulos’ group wrote. “One of the biggest potential benefits lies in the possibility of using such a system in countries that lack experienced breast radiologists, which might, for instance, impede the development, expansion, or continuation of screening programs.”
They cautioned, however, that the drawbacks of AI systems as stand-alone readers need clarification. “Regulations to define the medicolegal consequences when AI fails would have to be established,” they wrote. “Equally, trade-offs between patient outcome and cost-effectiveness have to be carefully addressed.”
In 2018, MedPage Today reported that an AI program was able to alter malignant features of DM images, highlighting the vulnerability of machine-based analysis to potential manipulation.
Stamatia Destounis, MD, of the University of Rochester in New York, called the results “extremely encouraging” for the development of AI algorithms.
“However, I would caution that this is a retrospective review with study cases from multiple collections that are enriched for cancers and also with benign false-positive findings,” she told MedPage Today.
In addition, the study was “an artificial representation of screening in a population,” said Destounis, who was not involved in the study.
She also noted that prior-year mammograms were available only for some mammograms, and this could have led to “differing interpretation comfort levels for radiologists. And the radiologists all knew these cases were enriched, which leads to bias.”
“To validate the AI algorithm, prospective screening studies would be needed, requiring very large numbers of patients, close collaboration with many centers, and considerable time intervals and expense,” Destounis continued.
Destounis pointed out that the study included U.S. radiologists, who typically have higher recall rates, and European radiologists, who use double reading, consensus, and/or arbitration to keep their recall rates down. “This is another variable that can confound the results,” she explained.
However, Destounis acknowledged that “AI could be used as an intelligent second reader, especially in locations with a scarcity of radiologists performing breast imaging interpretation, and it could help reduce the false-negative rate of mammography.”
Some co-authors disclosed employment with Siemens Healthineers and ScreenPoint Medical BV. One co-author disclosed support from Siemens Healthineers.
Sechopoulos and Destounis disclosed no relevant relationships with industry.