Inconsistency between Human Observation and Deep Learning Models: Assessing Validity of Postmortem Computed Tomography Diagnosis of Drowning
Jazyk angličtina Země Švýcarsko Médium print-electronic
Typ dokumentu časopisecké články
Grantová podpora
JP18K19892
Japan Society for the Promotion of Science
JP19H04479
Japan Society for the Promotion of Science
JP20K08012
Japan Society for the Promotion of Science
PubMed
38336949
PubMed Central
PMC11169324
DOI
10.1007/s10278-024-00974-6
PII: 10.1007/s10278-024-00974-6
Knihovny.cz E-zdroje
- Klíčová slova
- Computer-aided diagnosis, Deep learning, Drowning, Postmortem computed tomography, Validity assessment,
- MeSH
- deep learning * MeSH
- dítě MeSH
- dospělí MeSH
- lidé středního věku MeSH
- lidé MeSH
- mladiství MeSH
- mladý dospělý MeSH
- pitva * metody MeSH
- počítačová rentgenová tomografie * metody MeSH
- posmrtné zobrazování MeSH
- reprodukovatelnost výsledků MeSH
- retrospektivní studie MeSH
- ROC křivka MeSH
- senioři nad 80 let MeSH
- senioři MeSH
- utonutí * diagnóza MeSH
- Check Tag
- dítě MeSH
- dospělí MeSH
- lidé středního věku MeSH
- lidé MeSH
- mladiství MeSH
- mladý dospělý MeSH
- mužské pohlaví MeSH
- senioři nad 80 let MeSH
- senioři MeSH
- ženské pohlaví MeSH
- Publikační typ
- časopisecké články MeSH
Drowning diagnosis is a complicated process in the autopsy, even with the assistance of autopsy imaging and the on-site information from where the body was found. Previous studies have developed well-performed deep learning (DL) models for drowning diagnosis. However, the validity of the DL models was not assessed, raising doubts about whether the learned features accurately represented the medical findings observed by human experts. In this paper, we assessed the medical validity of DL models that had achieved high classification performance for drowning diagnosis. This retrospective study included autopsy cases aged 8-91 years who underwent postmortem computed tomography between 2012 and 2021 (153 drowning and 160 non-drowning cases). We first trained three deep learning models from a previous work and generated saliency maps that highlight important features in the input. To assess the validity of models, pixel-level annotations were created by four radiological technologists and further quantitatively compared with the saliency maps. All the three models demonstrated high classification performance with areas under the receiver operating characteristic curves of 0.94, 0.97, and 0.98, respectively. On the other hand, the assessment results revealed unexpected inconsistency between annotations and models' saliency maps. In fact, each model had, respectively, around 30%, 40%, and 80% of irrelevant areas in the saliency maps, suggesting the predictions of the DL models might be unreliable. The result alerts us in the careful assessment of DL tools, even those with high classification performance.
Faculty of Science University of South Bohemia in Ceske Budejovice Ceske Budejovice Czech Republic
Mechanical Engineering Czech Technical University Prague Prague Czech Republic
National Institute of Technology Sendai College Sendai Japan
Zobrazit více v PubMed
Status of drowning in South-East Asia: Country reports. World Health Organization (WHO). https://www.who.int/publications/i/item/9789290210115. Accessed December 15, 2022.
Vander Plaetsen S, De Letter E, Piette M, Van Parys G, Casselman JW, Verstraete K. Post-mortem evaluation of drowning with whole body CT. Forensic science international. 2015;249:35–41. doi: 10.1016/j.forsciint.2015.01.008. PubMed DOI
Christe A, Aghayev E, Jackowski C, Thali MJ, Vock P. Drowning—post-mortem imaging findings by computed tomography. European radiology. 2008;18:283–290. doi: 10.1007/s00330-007-0745-4. PubMed DOI
Usui A, Kawasumi Y, Funayama M, Saito H. Postmortem lung features in drowning cases on computed tomography. Japanese journal of radiology. 2014;32:414–420. doi: 10.1007/s11604-014-0326-9. PubMed DOI
Homma N, Zhang X, Qureshi A, Konno T, Kawasumi Y, Usui A, Funayama M, Bukovsky I, Ichiji K, Sugita N, Yoshizawa M: A deep learning aided drowning diagnosis for forensic investigations using post-mortem lung CT images. In 2020 42nd Annual International Conference of the IEEE Engineering in Medicine & Biology Society, pp.1262–1265. 10.1109/EMBC44109.2020.9175731, Jul 20, 2020. PubMed
Zeng Y, Zhang X, Kawasumi Y, Usui A, Ichiji K, Funayama M, Homma N: Deep learning-based interpretable computer-aided diagnosis of drowning for forensic radiology. In 2021 60th Annual Conference of the Society of Instrument and Control Engineers of Japan, pp. 820–824, Sep 8, 2021.
Ogawara T, Usui A, Homma N, Funayama M. Diagnosing drowning in postmortem CT images using artificial intelligence. The Tohoku Journal of Experimental Medicine. 2023;259(1):65–75. doi: 10.1620/tjem.2022.J097. PubMed DOI
Sadre R, Sundaram B, Majumdar S, Ushizima D. Validating deep learning inference during chest X-ray classification for COVID-19 screening. Scientific reports. 2021;11(1):16075. doi: 10.1038/s41598-021-95561-y. PubMed DOI PMC
Bae J, Yu S, Oh J, Kim TH, Chung JH, Byun H, Yoon MS, Ahn C, Lee DK. External validation of deep learning algorithm for detecting and visualizing femoral neck fracture including displaced and non-displaced fracture on plain X-ray. Journal of Digital Imaging. 2021;34(5):1099–1109. doi: 10.1007/s10278-021-00499-2. PubMed DOI PMC
Singh V, Danda V, Gorniak R, Flanders A, Lakhani P. Assessment of critical feeding tube malpositions on radiographs using deep learning. Journal of digital imaging. 2019;32:651–655. doi: 10.1007/s10278-019-00229-9. PubMed DOI PMC
Erten M, Tuncer I, Barua PD, Yildirim K, Dogan S, Tuncer T, Tan RS, Fujita H, Acharya UR. Automated urine cell image classification model using chaotic mixer deep feature extraction. Journal of Digital Imaging. 2023;2:1–2. PubMed PMC
Qiu S, Joshi PS, Miller MI, Xue C, Zhou X, Karjadi C, Chang GH, Joshi AS, Dwyer B, Zhu S, Kaku M. Development and validation of an interpretable deep learning framework for Alzheimer’s disease classification. Brain. 2020;143(6):1920–1933. doi: 10.1093/brain/awaa137. PubMed DOI PMC
Liu H, Li L, Wormstone IM, Qiao C, Zhang C, Liu P, Li S, Wang H, Mou D, Pang R, Yang D. Development and validation of a deep learning system to detect glaucomatous optic neuropathy using fundus photographs. JAMA ophthalmology. 2019;137(12):1353–1360. doi: 10.1001/jamaophthalmol.2019.3501. PubMed DOI PMC
Selvaraju RR, Cogswell M, Das A, Vedantam R, Parikh D, Batra D: Grad-cam: Visual explanations from deep networks via gradient-based localization. Proceedings of the IEEE international conference on computer vision, pp. 618–626. 10.48550/arXiv.1610.02391, 2017.
Zeiler MD, Fergus R. Visualizing and understanding convolutional networks. Proceedings of Computer Vision–ECCV, pp. 818–833, September 6–12, 2014.
Singh A, Sengupta S, Lakshminarayanan V. Explainable deep learning models in medical image analysis. Journal of Imaging. 2020;6(6):52. doi: 10.3390/jimaging6060052. PubMed DOI PMC
Zeng Y, Zhang X, Kawasumi Y, Usui A, Ichiji K, Funayama M, Homma N: A 2.5D deep learning-based method for drowning diagnosis using post-mortem computed tomography. IEEE Journal of Biomedical and Health Informatics 27(2):1026–1035, 2023. PubMed
Arun N, Gaw N, Singh P, Chang K, Aggarwal M, Chen B, Hoebel K, Gupta S, Patel J, Gidwani M, Adebayo J: Assessing the trustworthiness of saliency maps for localizing abnormalities in medical imaging. Radiology: Artificial Intelligence 3(6): e200267, 2021. PubMed PMC
Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional neural networks. Communications of the ACM. 2017;60(6):84–90. doi: 10.1145/3065386. DOI
Simonyan K, Zisserman A: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, Sep 4, 2014.
Szegedy C, Ioffe A, Vanhoucke V, Alemi AA: Inception-v4, inception-resnet and the impact of residual connections on learning. Thirty-first AAAI conference on artificial intelligence, pp. 4278–4284, 2017.
Ribeiro MT, Singh S, Guestrin C: Why should I trust you?" Explaining the predictions of any classifier: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp.1135–1144, 2016.
Lundberg SM, Lee SI: A unified approach to interpreting model predictions. Advances in neural information processing systems (NIPS) 30, 2017.
Springenberg JT, Dosovitskiy A, Brox T, Riedmiller M: Striving for simplicity: The all convolutional net. arXiv preprint arXiv:1412.6806, Dec 21, 2014
Chattopadhay A, Sarkar A, Howlader P, Balasubramanian VN: Grad-cam++: Generalized gradient-based visual explanations for deep convolutional networks. IEEE winter conference on applications of computer vision (WACV),pp. 839–847, 2018.
Reyes M, Meier R, Pereira S, Silva CA, Dahlweid FM, Tengg-Kobligk HV, Summers RM, Wiest R: On the interpretability of artificial intelligence in radiology: challenges and opportunities. Radiology: artificial intelligence 27;2(3):e190043, 2020. PubMed PMC
Wang H, Wang Z, Du M, Yang F, Zhang Z, Ding S, Mardziel P, Hu X: Score-CAM: Score-weighted visual explanations for convolutional neural networks. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pp. 24–33, 2020.
Wada K. Labelme: Image Polygonal Annotation with Python. https://github.com/wkentaro/labelme
Armato SG, III, McLennan G, Bidaut L, McNitt-Gray MF, Meyer CR, Reeves AP, Zhao B, Aberle DR, Henschke CI, Hoffman EA, Kazerooni EA. The lung image database consortium (LIDC) and image database resource initiative (IDRI): a completed reference database of lung nodules on CT scans. Medical physics. 2011;38(2):915–931. doi: 10.1118/1.3528204. PubMed DOI PMC
Boggust A, Hoover B, Satyanarayan A, Strobelt H: Shared interest: Measuring human-AI alignment to identify recurring patterns in model behavior. Proceedings of the CHI Conference on Human Factors in Computing Systems, pp. 1–17, 2022.
Hoiem D, Chodpathumwan Y, Dai Q: Diagnosing error in object detectors. In European conference on computer vision, pp. 340–353, Oct 7, 2012.
Redmon J, Divvala S, Girshick R, Farhadi A: You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 779–788. 2016.
Otsu N. A threshold selection method from gray-level histograms. IEEE transactions on systems, man, and cybernetics. 1979;9(1):62–66. doi: 10.1109/TSMC.1979.4310076. DOI
Hausman NL, Javed N, Bednar MK, Guell M, Schaller E, Nevill RE, Kahng S. Interobserver consistency: A preliminary investigation into how much is enough? Journal of applied behavior analysis. 2022;55(2):357–368. doi: 10.1002/jaba.811. PubMed DOI
Amgad M, Atteya LA, Hussein H, Mohammed KH, Hafiz E, Elsebaie MA, Alhusseiny AM, AlMoslemany MA, Elmatboly AM, Pappalardo PA, Sakr RA. NuCLS: A scalable crowdsourcing approach and dataset for nucleus classification and segmentation in breast cancer. Giga Science. 2022;11:1–12. doi: 10.1093/gigascience/giac037. PubMed DOI PMC