International multicenter validation of AI-driven ultrasound detection of ovarian cancer
Jazyk angličtina Země Spojené státy americké Médium print-electronic
Typ dokumentu časopisecké články, multicentrická studie, validační studie
Grantová podpora
231143
Radiumhemmets Forskningsfonder (Cancer Research Foundations of Radiumhemmet)
211657 Pi 01 H
Cancerfonden (Swedish Cancer Society)
2020-01702
Vetenskapsrdet (Swedish Research Council)
PubMed
39747679
PubMed Central
PMC11750711
DOI
10.1038/s41591-024-03329-4
PII: 10.1038/s41591-024-03329-4
Knihovny.cz E-zdroje
- MeSH
- deep learning MeSH
- dospělí MeSH
- lidé středního věku MeSH
- lidé MeSH
- nádory vaječníků * diagnostické zobrazování MeSH
- neuronové sítě * MeSH
- retrospektivní studie MeSH
- senioři MeSH
- senzitivita a specificita MeSH
- ultrasonografie * metody MeSH
- umělá inteligence MeSH
- Check Tag
- dospělí MeSH
- lidé středního věku MeSH
- lidé MeSH
- senioři MeSH
- ženské pohlaví MeSH
- Publikační typ
- časopisecké články MeSH
- multicentrická studie MeSH
- validační studie MeSH
Ovarian lesions are common and often incidentally detected. A critical shortage of expert ultrasound examiners has raised concerns of unnecessary interventions and delayed cancer diagnoses. Deep learning has shown promising results in the detection of ovarian cancer in ultrasound images; however, external validation is lacking. In this international multicenter retrospective study, we developed and validated transformer-based neural network models using a comprehensive dataset of 17,119 ultrasound images from 3,652 patients across 20 centers in eight countries. Using a leave-one-center-out cross-validation scheme, for each center in turn, we trained a model using data from the remaining centers. The models demonstrated robust performance across centers, ultrasound systems, histological diagnoses and patient age groups, significantly outperforming both expert and non-expert examiners on all evaluated metrics, namely F1 score, sensitivity, specificity, accuracy, Cohen's kappa, Matthew's correlation coefficient, diagnostic odds ratio and Youden's J statistic. Furthermore, in a retrospective triage simulation, artificial intelligence (AI)-driven diagnostic support reduced referrals to experts by 63% while significantly surpassing the diagnostic performance of the current practice. These results show that transformer-based models exhibit strong generalization and above human expert-level diagnostic accuracy, with the potential to alleviate the shortage of expert ultrasound examiners and improve patient outcomes.
3rd Faculty of Medicine Charles University Prague Czech Republic
Department of Clinical Science and Education Södersjukhuset Karolinska Institutet Stockholm Sweden
Department of Gynecological Oncology and Gynecology Medical University of Lublin Lublin Poland
Department of Medicine and Surgery University of Milan Bicocca Milan Italy
Department of Obstetrics and Gynaecology Lithuanian University of Health Sciences Kaunas Lithuania
Department of Obstetrics and Gynecology Clínica Universidad de Navarra Pamplona Spain
Department of Obstetrics and Gynecology Rizal Medical Center Manila Philippines
Department of Obstetrics and Gynecology Skåne University Hospital Lund Sweden
Department of Obstetrics and Gynecology Södersjukhuset Stockholm Sweden
Department of Obstetrics Gynecology and Reproduction Dexeus University Hospital Barcelona Spain
Digital Futures KTH Royal Institute of Technology Stockholm Sweden
Fondazione Poliambulanza Istituto Ospedaliero Brescia Italy
Gynecologic and Obstetric Unit Women's and Children's Department Forlì Hospital Forlì Italy
Gynecology and Breast Care Center Mater Olbia Hospital Olbia Italy
Institute for Maternal and Child Health IRCCS 'Burlo Garofolo' Trieste Italy
Institute for the Care of Mother and Child Prague Czech Republic
Obstetrics and Gynecology Unit Forlì and Faenza Hospitals AUSL Romagna Forlì Italy
Science for Life Laboratory Stockholm Sweden
Unit of Preventive Gynecology European Institute of Oncology IRCCS Milan Italy
UO Gynecology Fondazione IRCCS San Gerardo dei Tintori Monza Italy
Zobrazit více v PubMed
Yazbek, J. et al. Effect of quality of gynaecological ultrasonography on management of patients with suspected ovarian cancer: a randomised controlled trial. Lancet Oncol.9, 124–131 (2008). PubMed
Froyman, W. et al. Risk of complications in patients with conservatively managed ovarian tumours (IOTA5): a 2-year interim analysis of a multicentre, prospective, cohort study. Lancet Oncol.20, 448–458 (2019). PubMed
Vergote, I. et al. Prognostic importance of degree of differentiation and cyst rupture in stage I invasive epithelial ovarian carcinoma. Lancet357, 176–182 (2001). PubMed
Bristow, R. E., Tomacruz, R. S., Armstrong, D. K., Trimble, E. L. & Montz, F. J. Survival effect of maximal cytoreductive surgery for advanced ovarian carcinoma during the platinum era: a meta-analysis. J. Clin. Oncol.41, 4065–4076 (2023). PubMed
Timmerman, D. et al. ESGO/ISUOG/IOTA/ESGE Consensus Statement on pre-operative diagnosis of ovarian tumors. Int. J. Gynecol. Cancer31, 961–982 (2021). PubMed PMC
Van Holsbeke, C. et al. Ultrasound methods to distinguish between malignant and benign adnexal masses in the hands of examiners with different levels of experience. Ultrasound Obstet. Gynecol.34, 454–461 (2009). PubMed
Van Holsbeke, C. et al. Ultrasound experience substantially impacts on diagnostic performance and confidence when adnexal masses are classified using pattern recognition. Gynecol. Obstet. Invest.69, 160–168 (2010). PubMed
Timmerman, D. et al. Subjective assessment of adnexal masses with the use of ultrasonography: an analysis of interobserver variability and experience. Ultrasound Obstet. Gynecol.13, 11–16 (1999). PubMed
Christiansen, F. et al. Ultrasound image analysis using deep neural networks for discriminating between benign and malignant ovarian tumors: comparison with expert subjective assessment. Ultrasound Obstet. Gynecol.57, 155–163 (2021). PubMed PMC
Gao, Y. et al. Deep learning-enabled pelvic ultrasound images for accurate diagnosis of ovarian cancer in China: a retrospective, multicentre, diagnostic study. Lancet Digit. Health4, e179–e187 (2022). PubMed
Cohen, J. P. et al. Problems in the deployment of machine-learned models in health care. CMAJ193, e1391–e1394 (2021). PubMed PMC
Goodfellow, I., Bengio, Y. & Courville, A. Deep Learning (MIT Press, 2016).
Stacke, K. et al. Measuring domain shift for deep learning in histopathology. IEEE J. Biomed. Health Inform.25, 325–336 (2020). PubMed
Sharifzadeh, M., Tehrani, A. K., Benali, H. & Rivaz, H. Ultrasound domain adaptation using frequency domain analysis. 2021 IEEE International Ultrasonics Symposium (IUS), 1–4 (2021).
Tierney, J., et al. Accounting for domain shift in neural network ultrasound beamforming. 2020 IEEE International Ultrasonics Symposium (IUS), 1–3 (2020).
Liu, X. et al. A comparison of deep learning performance against health-care professionals in detecting diseases from medical imaging: a systematic review and meta-analysis. Lancet Digit. Health1, e271–e297 (2019). PubMed
Chalkidou, A. et al. Recommendations for the development and use of imaging test sets to investigate the test performance of artificial intelligence in health screening. Lancet Digit. Health4, e899–e905 (2022). PubMed
Kelly, C. J., Karthikesalingam, A., Suleyman, M., Corrado, G. & King, D. Key challenges for delivering clinical impact with artificial intelligence. BMC Med.17, 195 (2019). PubMed PMC
Dosovitskiy, A. et al. An image is worth 16x16 words: transformers for image recognition at scale. International Conference on Learning Representations (2020).
Touvron, H., Cord, M. & Jégou, H. DeiT III: Revenge of the ViT. 17th European Conference on Computer Vision, 516–533 (2022).
Matsoukas, C., Haslum, J. F., Sorkhei, M., Söderberg, M. & Smith, K. What makes transfer learning work for medical images: feature reuse & other factors. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9225–9234 (2022).
Shamshad, F. et al. Transformers in medical imaging: a survey. Med. Image Anal.88, 102802 (2023). PubMed
Van Calster, B. et al. Calibration: The Achilles heel of predictive analytics. BMC Med.17, 1–7 (2019). PubMed PMC
Van Calster, B. et al. A calibration hierarchy for risk models was defined: from utopia to empirical data. J. Clin. Epidemiol.74, 167–176 (2016). PubMed
Caron, M., et al. Emerging properties in self-supervised vision transformers. Proceedings of the IEEE/CVF International Conference on Computer Vision, 9650–9660 (2021).
Redmon, J., Divvala, S., Girshick, R. & Farhadi, A. You only look once: unified, real-time object detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 779–788 (2016).
Brown, L. D., Cai, T. T. & DasGupta, A. Interval estimation for a binomial proportion. Stat. Sci.16, 101–133 (2001).
Minderer, M. et al. Revisiting the calibration of modern neural networks. Adv. Neural Inf. Process. Syst.34, 15682–15694 (2021).
Mukhoti, J. et al. Calibrating deep neural networks using focal loss. Adv. Neural Inf. Process. Syst.33, 15288–15299 (2020).
Bishop, C. M. Pattern Recognition and Machine Learning (Springer, 2006).
Vaseli, H., et al. ProtoASNet: Dynamic Prototypes for Inherently Interpretable and Uncertainty-Aware Aortic Stenosis Classification in Echocardiography. In International Conference on Medical Image Computing and Computer-Assisted Intervention, 368–378 (2023).
Selvaraju, R. R., et al. Grad-cam: visual explanations from deep networks via gradient-based localization. Proceedings of the IEEE International Conference on Computer Vision, 618–626 (2017).
Glas, A. S., Lijmer, J. G., Prins, M. H., Bonsel, G. J. & Bossuyt, P. M. The diagnostic odds ratio: a single indicator of test performance. J. Clin. Epidemiol.56, 1129–1135 (2003). PubMed
Hlatky, M. A. et al. Factors affecting sensitivity and specificity of exercise electrocardiography: multivariable analysis. Am. J. Med.77, 64–71 (1984). PubMed
Moons, K. G., van Es, G. A., Deckers, J. W., Habbema, D. J. & Grobbee, D. E. Limitations of sensitivity, specificity, likelihood ratio, and Bayes’ theorem in assessing diagnostic probabilities: a clinical example. Epidemiology8, 12–17 (1997). PubMed
Koch, A. H. et al. Analysis of computer-aided diagnostics in the preoperative diagnosis of ovarian cancer: a systematic review. Insights Imaging14, 34 (2023). PubMed PMC
Van Calster, B., Timmerman, S., Geysels, A., Verbakel, J. Y. & Froyman, W. A deep-learning-enabled diagnosis of ovarian cancer. Lancet Digit. Health4, e630 (2022). PubMed
Meys, E. et al. Subjective assessment versus ultrasound models to diagnose ovarian cancer: A systematic review and meta-analysis. Eur. J. Cancer58, 17–29 (2016). PubMed
Reitsma, J. B. et al. Bivariate analysis of sensitivity and specificity produces informative summary measures in diagnostic reviews. J. Clin. Epidemiol.58, 982–990 (2005). PubMed
Van Calster, B. et al. Discrimination between benign and malignant adnexal masses by specialist ultrasound examination versus serum CA-125. J. Natl Cancer Inst.99, 1706–1714 (2007). PubMed
Deng, J., et al. ImageNet: a large-scale hierarchical image database. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 248–255 (2009).
Lin, T. Y., Goyal, P., Girshick, R., He, K. & Dollár, P. Focal loss for dense object detection. Proceedings of the IEEE International Conference on Computer Vision, 2980–2988 (2017).
Cubuk, E. D., Zoph, B., Shlens, J. & Le, Q. V. Randaugment: practical automated data augmentation with a reduced search space. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 3008–2017 (2020).
Vaswani, A. et al. Attention is all you need. Adv. Neural Inf. Process. Syst.30, 5998–6008 (2017).
Singhal, K. et al. Large language models encode clinical knowledge. Nature620, 172–180 (2023). PubMed PMC
Gheflati, B. & Rivaz, H. Vision transformers for classification of breast ultrasound images. 44th Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), 480–483 (2022). PubMed
Loshchilov, I. & Hutter, F. Decoupled weight decay regularization. International Conference on Learning Representations (2019).
Esteva, A. et al. Dermatologist-level classification of skin cancer with deep neural networks. Nature542, 115–118 (2017). PubMed PMC
Rey, D. & Neuhäuser, M. Wilcoxon-signed-rank test. In: Lovric M. (ed) International Encyclopedia of Statistical Science (Springer, 2011).
Efron, B. & Hastie, T. Computer Age Statistical Inference: Algorithms, Evidence, and Data Science (Cambridge University Press, 2016).