Understanding metric-related pitfalls in image analysis validation
Jazyk angličtina Země Spojené státy americké Médium print-electronic
Typ dokumentu časopisecké články, přehledy
Grantová podpora
213038
Wellcome Trust - United Kingdom
UH3 CA225021
NCI NIH HHS - United States
203148
Wellcome Trust - United Kingdom
U01 CA242871
NCI NIH HHS - United States
U24 CA279629
NCI NIH HHS - United States
R01 NS042645
NINDS NIH HHS - United States
P41 GM135019
NIGMS NIH HHS - United States
Wellcome Trust - United Kingdom
U24 CA215109
NCI NIH HHS - United States
EP-W-17-011
EPA - United States
CEP - Centrální evidence projektů
U24 CA180924
NCI NIH HHS - United States
PubMed
38347140
PubMed Central
PMC11181963
DOI
10.1038/s41592-023-02150-0
PII: 10.1038/s41592-023-02150-0
Knihovny.cz E-zdroje
- MeSH
- umělá inteligence * MeSH
- Publikační typ
- časopisecké články MeSH
- přehledy MeSH
Validation metrics are key for tracking scientific progress and bridging the current chasm between artificial intelligence research and its translation into practice. However, increasing evidence shows that, particularly in image analysis, metrics are often chosen inadequately. Although taking into account the individual strengths, weaknesses and limitations of validation metrics is a critical prerequisite to making educated choices, the relevant knowledge is currently scattered and poorly accessible to individual researchers. Based on a multistage Delphi process conducted by a multidisciplinary expert consortium as well as extensive community feedback, the present work provides a reliable and comprehensive common point of access to information on pitfalls related to validation metrics in image analysis. Although focused on biomedical image analysis, the addressed pitfalls generalize across application domains and are categorized according to a newly created, domain-agnostic taxonomy. The work serves to enhance global comprehension of a key topic in image analysis validation.
Allen Institute for Cell Science Seattle WA USA
ARTORG Center for Biomedical Engineering Research University of Bern Bern Switzerland
Cell Biology and Biophysics Unit European Molecular Biology Laboratory Heidelberg Germany
Center for Biomedical Image Computing and Analytics University of Pennsylvania Philadelphia PA USA
Centre for Intelligent Machines and MILA McGill University Montréal Quebec Canada
Centre for Medical Image Computing University College London London UK
Department AIBE Friedrich Alexander Universität Erlangen Nürnberg Germany
Department of Biomedical Data Sciences Leiden University Medical Center Leiden the Netherlands
Department of Biomedical Informatics Stony Brook University Health Science Center Stony Brook NY USA
Department of Computer Science IT University of Copenhagen Copenhagen Denmark
Department of Computer Science University of Toronto Toronto Ontario Canada
Department of Computing Faculty of Engineering Imperial College London London UK
Department of Computing Imperial College London South Kensington Campus London UK
Department of Development and Regeneration and EPI centre KU Leuven Leuven Belgium
Department of Digital Medical Technologies Holon Institute of Technology Holon Israel
Department of Medical Biophysics University of Toronto Toronto Ontario Canada
Department of Pathology Radboud University Medical Center Nijmegen the Netherlands
Department of Quantitative Biomedicine University of Zurich Zurich Switzerland
Department of Radiation Oncology University Hospital Bern University of Bern Bern Switzerland
Department of Surgery Perelman School of Medicine Philadelphia PA USA
Department of Surgery University Health Network Philadelphia PA USA
Electrical Engineering Vanderbilt University Nashville TN USA
European Federation for Medical Informatics Le Mont sur Lausanne Switzerland
Faculty of Mathematics and Computer Science Heidelberg University Heidelberg Germany
Faculty of Medicine Heidelberg University Hospital Heidelberg Germany
Frankfurt Cancer Insititute Frankfurt am Main Germany
Fraunhofer MEVIS Bremen Germany
German Cancer Research Center Heidelberg Division of Biostatistics Heidelberg Germany
German Cancer Research Center Heidelberg Division of Intelligent Medical Systems Heidelberg Germany
German Cancer Research Center Heidelberg Division of Medical Image Computing Heidelberg Germany
German Cancer Research Center Heidelberg Heidelberg Germany
German Cancer Research Center Heidelberg HI Applied Computer Vision Lab Heidelberg Germany
German Cancer Research Center Heidelberg HI Helmholtz Imaging Heidelberg Germany
German Cancer Research Center Heidelberg Interactive Machine Learning Group Heidelberg Germany
Goethe University Frankfurt Department of Informatics Frankfurt am Main Germany
Goethe University Frankfurt Department of Medicine Frankfurt am Main Germany
Google Health Google Palo Alto CA USA
Helmholtz AI Oberschleißheim Germany
IHU Strasbourg Strasbourg France
Imaging Platform Broad Institute of MIT and Harvard Cambridge MA USA
Informatics Institute Faculty of Science University of Amsterdam Amsterdam the Netherlands
Information Systems Institute University of Applied Sciences Western Switzerland Sierre Switzerland
Institute for Computational Biomedicine Heidelberg University Heidelberg Germany
Institute of Information Systems Engineering TU Wien Vienna Austria
Instituto de Cálculo CONICET Universidad de Buenos Aires Buenos Aires Argentina
Laboratoire Traitement du Signal et de l'Image UMR_S 1099 Université de Rennes 1 Rennes France
Leibniz Institut für Analytische Wissenschaften ISAS e 5 Dortmund Germany
Medical Faculty University of Geneva Geneva Switzerland
National Institute of Allergy and Infectious Diseases Bethesda MD USA
National Institutes of Health Clinical Center Bethesda MD USA
Neurocenter Oulu Oulu University Hospital Oulu Finland
Parietal project team INRIA Saclay Île de France Palaiseau France
Physical Sciences Sunnybrook Research Institute Toronto Ontario Canada
Princess Margaret Cancer Centre University Health Network Toronto Ontario Canada
Radboud Institute for Health Sciences Radboud University Medical Center Nijmegen the Netherlands
Research Unit of Health Sciences and Technology Faculty of Medicine University of Oulu Oulu Finland
School of Biomedical Engineering and Imaging Science King's College London London UK
School of Engineering The University of Edinburgh Edinburgh Scotland
Simula Metropolitan Center for Digital Engineering Oslo Norway
Tissue Image Analytics Laboratory Department of Computer Science University of Warwick Coventry UK
Translational Image guided Oncology University Medicine Essen Essen Germany
UiT The Arctic University of Norway Tromsø Norway
Universitat Pompeu Fabra Barcelona Spain
University of Adelaide Adelaide South Australia Australia
University of Potsdam Digital Engineering Faculty Potsdam Germany
Vector Institute for Artificial Intelligence Toronto Ontario Canada
Zobrazit více v PubMed
Bilic Patrick, Christ Patrick, Li Hongwei Bran, Vorontsov Eugene, Ben-Cohen Avi, Kaissis Georgios, Szeskin Adi, Jacobs Colin, Mamani Gabriel Efrain Humpire, Chartrand Gabriel, et al. The liver tumor segmentation benchmark (lits). Medical Image Analysis, 84:102680, 2023. PubMed PMC
Brown Bernice B. Delphi process: a methodology used for the elicitation of opinions of experts. Technical report, Rand Corp Santa Monica CA, 1968.
Carbonell Alberto, De la Pena Marcos, Flores Ricardo, and Gago Selma. Effects of the trinucleotide preceding the self-cleavage site on eggplant latent viroid hammerheads: differences in co-and post-transcriptional self-cleavage may explain the lack of trinucleotide auc in most natural hammerheads. Nucleic acids research, 34(19):5613–5622, 2006. PubMed PMC
Chen Jianxu, Ding Liya, Viana Matheus P, Lee HyeonWoo, Sluezwski M Filip, Morris Benjamin, Hendershott Melissa C, Yang Ruian, Mueller Irina A, and Rafelski Susanne M. The allen cell and structure segmenter: a new open source toolkit for segmenting 3d intracellular structures in fluorescence microscopy images. BioRxiv, page 491035, 2020.
Chicco Davide and Jurman Giuseppe. The advantages of the matthews correlation coefficient (mcc) over f1 score and accuracy in binary classification evaluation. BMC genomics, 21(1):1–13, 2020. PubMed PMC
Chicco Davide, Tötsch Niklas, and Jurman Giuseppe. The matthews correlation coefficient (mcc) is more reliable than balanced accuracy, bookmaker informedness, and markedness in two-class confusion matrix evaluation. BioData mining, 14(1):1–22, 2021. The manuscript addresses the challenge of evaluating binary classifications. It compares MCC to other metrics, explaining their mathematical relationships and providing use cases where MCC offers more informative results. PubMed PMC
Cordts Marius, Omran Mohamed, Ramos Sebastian, Scharwächter Timo, Enzweiler Markus, Benenson Rodrigo, Franke Uwe, Roth Stefan, and Schiele Bernt. The cityscapes dataset. In CVPR Workshop on The Future of Datasets in Vision, 2015.
Correia Paulo and Pereira Fernando. Video object relevance metrics for overall segmentation quality evaluation. EURASIP Journal on Advances in Signal Processing, 2006:1–11, 2006.
Sabatino Antonio Di and Corazza Gino Roberto. Nonceliac gluten sensitivity: sense or sensibility?, 2012. PubMed
Everingham Mark, Luc Van Gool, Williams Christopher KI, Winn John, and Zisserman Andrew. The pascal visual object classes (voc) challenge. International journal of computer vision, 88(2):303–338, 2010.
Gooding Mark J, Smith Annamarie J, Tariq Maira, Aljabar Paul, Peressutti Devis, van der Stoep Judith, Reymen Bart, Emans Daisy, Hattu Djoya, van Loon Judith, et al. Comparative evaluation of autocontouring in clinical practice: a practical method using the turing test. Medical physics, 45(11):5105–5115, 2018. PubMed
Gooding Mark J, Boukerroui Djamal, Osorio Eliana Vasquez, Monshouwer René, and Brunenberg Ellen. Multicenter comparison of measures for quantitative evaluation of contouring in radiotherapy. Physics and Imaging in Radiation Oncology, 24:152–158, 2022. PubMed PMC
Grandini Margherita, Bagli Enrico, and Visani Giorgio. Metrics for multi-class classification: an overview. arXiv preprint arXiv:2008.05756, 2020.
Gruber Sebastian and Buettner Florian. Trustworthy deep learning via proper calibration errors: A unifying approach for quantifying the reliability of predictive uncertainty. arXiv preprint arXiv:2203.07835, 2022.
Honauer Katrin, Maier-Hein Lena, and Kondermann Daniel. The hci stereo metrics: Geometry-aware performance analysis of stereo algorithms. In Proceedings of the IEEE International Conference on Computer Vision, pages 2120–2128, 2015.
Kaggle. Satorius Cell Instance Segmentation 2021. https://www.kaggle.com/c/sartorius-cell-instance-segmentation, 2021. [Online; accessed 25-April-2022].
Kofler Florian, Ezhov Ivan, Isensee Fabian, Berger Christoph, Korner Maximilian, Paetzold Johannes, Li Hongwei, Shit Suprosanna, McKinley Richard, Bakas Spyridon, et al. Are we using appropriate segmentation metrics? Identi- fying correlates of human expert perception for CNN training beyond rolling the DICE coefficient. arXiv preprint arXiv:2103.06205v1, 2021.
Konukoglu Ender, Glocker Ben, Ye Dong Hye, Criminisi Antonio, and Pohl Kilian M. Discriminative segmentation-based evaluation through shape dissimilarity. IEEE transactions on medical imaging, 31(12):2278–2289, 2012. PubMed PMC
Lennerz Jochen K, Green Ursula, Williamson Drew FK, and Mahmood Faisal. A unifying force for the realization of medical ai. npj Digital Medicine, 5(1):1–3, 2022. PubMed PMC
Lin Tsung-Yi, Maire Michael, Belongie Serge, Hays James, Perona Pietro, Ramanan Deva, Dollár Piotr, and Zitnick C Lawrence. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.
Maier-Hein Lena, Eisenmann Matthias, Reinke Annika, Onogur Sinan, Stankovic Marko, Scholz Patrick, Arbel Tal, Bogunovic Hrvoje, Bradley Andrew P, Carass Aaron, et al. Why rankings of biomedical image analysis competitions should be interpreted with care. Nature communications, 9(1):1–13, 2018. With this comprehensive analysis of biomedical image analysis competitions (challenges), the authors initiated a shift in how such challenges are designed, performed, and reported in the biomedical domain. Its concepts and guidelines have been adopted by reputed organizations such as MICCAI. PubMed PMC
Maier-Hein Lena, Reinke Annika, Christodoulou Evangelia, Glocker Ben, Godau Patrick, Isensee Fabian, Kleesiek Jens, Kozubek Michal, Reyes Mauricio, Riegler Michael A, et al. Metrics reloaded: Pitfalls and recommendations for image analysis validation. arXiv preprint arXiv:2206.01653, 2022.
Margolin Ran, Zelnik-Manor Lihi, and Tal Ayellet. How to evaluate foreground maps? In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 248–255, 2014.
Muschelli John. Roc and auc with a binary predictor: a potentially misleading metric. Journal of classification, 37(3): 696–708, 2020. PubMed PMC
Nasa Prashant, Jain Ravi, and Juneja Deven. Delphi methodology in healthcare research: how to decide its appropriate- ness. World Journal of Methodology, 11(4):116, 2021. PubMed PMC
Ounkomol Chawin, Seshamani Sharmishtaa, Maleckar Mary M, Collman Forrest, and Johnson Gregory R. Label-free prediction of three-dimensional fluorescence images from transmitted-light microscopy. Nature methods, 15(11): 917–920, 2018. PubMed PMC
Reinke Annika, Eisenmann Matthias, Tizabi Minu D, Sudre Carole H, Rädsch Tim, Antonelli Michela, Arbel Tal, Bakas Spyridon, Cardoso M Jorge, Cheplygina Veronika, Farahani Keyvan, Glocker Ben, Heckmann-Nötzel Doreen, Isensee Fabian, Jannin Pierre, Kahn Charles, Kleesiek Jens, Kurc Tahsin, Kozubek Michal, Landman Bennett A, Litjens Geert, Maier-Hein Klaus, Martel Anne L, Müller Henning, Petersen Jens, Reyes Mauricio, Rieke Nicola, Stieltjes Bram, Summers Ronald M, Tsaftaris Sotirios A, van Ginneken Bram, Kopp-Schneider Annette, Jäger Paul, and Maier-Hein Lena. Common limitations of image processing metrics: A picture story. arXiv preprint arXiv:2104.05642, 2021.
Reinke Annika, Eisenmann Matthias, Tizabi Minu D, Sudre Carole H, Rädsch Tim, Antonelli Michela, Arbel Tal, Bakas Spyridon, Cardoso M Jorge, Cheplygina Veronika, et al. Common limitations of image processing metrics: A picture story. arXiv preprint arXiv:2104.05642, 2021.
Roberts Brock, Haupt Amanda, Tucker Andrew, Grancharova Tanya, Arakaki Joy, Fuqua Margaret A, Nelson Angelique, Hookway Caroline, Ludmann Susan A, Mueller Irina A, et al. Systematic gene tagging using crispr/cas9 in human stem cells to illuminate cell organization. Molecular biology of the cell, 28(21):2854–2874, 2017. PubMed PMC
Schmidt Uwe, Weigert Martin, Broaddus Coleman, and Myers Gene. Cell detection with star-convex polygons. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 265–273. Springer, 2018.
Stringer Carsen, Wang Tim, Michaelos Michalis, and Pachitariu Marius. Cellpose: a generalist algorithm for cellular segmentation. Nature methods, 18(1):100–106, 2021. PubMed
Taha Abdel Aziz and Hanbury Allan. Metrics for evaluating 3d medical image segmentation: analysis, selection, and tool. BMC medical imaging, 15(1):1–28, 2015. The paper discusses the importance of effective metrics for evaluating the accuracy of 3D medical image segmentation algorithms. The authors analyze existing metrics, propose a selection methodology, and develop a tool to aid researchers in choosing appropriate evaluation metrics based on the specific characteristics of the segmentation task. PubMed PMC
Taha Abdel Aziz, Hanbury Allan, and Jimenez del Toro Oscar A. A formal method for selecting evaluation metrics for image segmentation. In 2014 IEEE international conference on image processing (ICIP), pages 932–936. IEEE, 2014.
Tran Thuy Nuong, Adler Tim, Yamlahi Amine, Christodoulou Evangelia, Godau Patrick, Reinke Annika, Tizabi Minu Dietlinde, Sauer Peter, Persicke Tillmann, Albert Jörg Gerhard, et al. Sources of performance variability in deep learning- based polyp detection. arXiv preprint arXiv:2211.09708, 2022. PubMed PMC
Vaassen Femke, Hazelaar Colien, Vaniqui Ana, Gooding Mark, Brent van der Heyden, Richard Canters, and Wouter van Elmpt. Evaluation of measures for assessing time-saving of automatic organ-at-risk segmentation in radiotherapy. Physics and Imaging in Radiation Oncology, 13:1–6, 2020. PubMed PMC
Viana Matheus P, Chen Jianxu, Knijnenburg Theo A, Vasan Ritvik, Yan Calysta, Arakaki Joy E, Bailey Matte, Berry Ben, Borensztejn Antoine, Brown Eva M, et al. Integrated intracellular organization and its variations in human ips cells. Nature, pages 1–10, 2023. PubMed PMC
Wiesenfarth Manuel, Reinke Annika, Landman Bennett A, Eisenmann Matthias, Saiz Laura Aguilera, Cardoso M Jorge, Maier-Hein Lena, and Kopp-Schneider Annette. Methods and open-source toolkit for analyzing and visualizing challenge results. Scientific Reports, 11(1):1–15, 2021. PubMed PMC
Yeghiazaryan Varduhi and Voiculescu Irina D. Family of boundary overlap metrics for the evaluation of medical image segmentation. Journal of Medical Imaging, 5(1):015006, 2018. PubMed PMC
Hirling Dominik, Tasnadi Ervin, Caicedo Juan, Caroprese Maria V, Sjögren Rickard, Aubreville Marc, Koos Krisztian, and Horvath Peter. Segmentation metric misinterpretations in bioimage analysis. Nature methods, pages 1–4, 2023. PubMed PMC
Metrics reloaded: recommendations for image analysis validation