Metrics reloaded: recommendations for image analysis validation
Jazyk angličtina Země Spojené státy americké Médium print-electronic
Typ dokumentu časopisecké články, přehledy
Grantová podpora
UH3 CA225021
NCI NIH HHS - United States
U01 CA242871
NCI NIH HHS - United States
U24 CA279629
NCI NIH HHS - United States
R01 NS042645
NINDS NIH HHS - United States
P41 GM135019
NIGMS NIH HHS - United States
U24 CA215109
NCI NIH HHS - United States
EP-W-17-011
EPA - United States
CEP - Centrální evidence projektů
U24 CA180924
NCI NIH HHS - United States
PubMed
38347141
PubMed Central
PMC11182665
DOI
10.1038/s41592-023-02151-z
PII: 10.1038/s41592-023-02151-z
Knihovny.cz E-zdroje
- MeSH
- algoritmy * MeSH
- počítačové zpracování obrazu * MeSH
- sémantika MeSH
- strojové učení MeSH
- Publikační typ
- časopisecké články MeSH
- přehledy MeSH
Increasing evidence shows that flaws in machine learning (ML) algorithm validation are an underestimated global problem. In biomedical image analysis, chosen performance metrics often do not reflect the domain interest, and thus fail to adequately measure scientific progress and hinder translation of ML techniques into practice. To overcome this, we created Metrics Reloaded, a comprehensive framework guiding researchers in the problem-aware selection of metrics. Developed by a large international consortium in a multistage Delphi process, it is based on the novel concept of a problem fingerprint-a structured representation of the given problem that captures all aspects that are relevant for metric selection, from the domain interest to the properties of the target structure(s), dataset and algorithm output. On the basis of the problem fingerprint, users are guided through the process of choosing and applying appropriate validation metrics while being made aware of potential pitfalls. Metrics Reloaded targets image analysis problems that can be interpreted as classification tasks at image, object or pixel level, namely image-level classification, object detection, semantic segmentation and instance segmentation tasks. To improve the user experience, we implemented the framework in the Metrics Reloaded online tool. Following the convergence of ML methodology across application domains, Metrics Reloaded fosters the convergence of validation methodology. Its applicability is demonstrated for various biomedical use cases.
ARTORG Center for Biomedical Engineering Research University of Bern Bern Switzerland
BCN Medtech Universitat Pompeu Fabra Barcelona Spain
Cell Biology and Biophysics Unit European Molecular Biology Laboratory Heidelberg Germany
Center for Biomedical Image Computing and Analytics University of Pennsylvania Philadelphia PA USA
Center for Scalable Data Analytics and Artificial Intelligence Leipzig University Leipzig Germany
Center for Systems Biology Dresden Germany
Centre for Intelligent Machines and MILA McGill University Montréal Quebec Canada
Centre for Medical Image Computing University College London London UK
Centre for Statistics in Medicine University of Oxford Nuffield Orthopaedic Centre Oxford UK
Department AIBE Friedrich Alexander Universität Erlangen Nürnberg Germany
Department of Biomedical Data Sciences Leiden University Medical Center Leiden the Netherlands
Department of Biomedical Informatics Stony Brook University Health Science Center Stony Brook NY USA
Department of Computer Science IT University of Copenhagen Copenhagen Denmark
Department of Computer Science UiT The Arctic University of Norway Tromsø Norway
Department of Computer Science University of Toronto Toronto Ontario Canada
Department of Computing Faculty of Engineering Imperial College London London UK
Department of Computing Imperial College London South Kensington Campus London UK
Department of Development and Regeneration and EPI centre KU Leuven Leuven Belgium
Department of Digital Medical Technologies Holon Institute of Technology Holon Israel
Department of Informatics Goethe University Frankfurt Frankfurt am Main Germany
Department of Medical Biophysics University of Toronto Toronto Ontario Canada
Department of Medicine Goethe University Frankfurt Frankfurt am Main Germany
Department of Pathology Radboud University Medical Center Nijmegen the Netherlands
Department of Quantitative Biomedicine University of Zurich Zurich Switzerland
Department of Radiation Oncology University Hospital Bern University of Bern Bern Switzerland
Department of Surgery Perelman School of Medicine Philadelphia PA USA
Department of Surgery University Health Network Philadelphia PA USA
Digital Engineering Faculty University of Potsdam Potsdam Germany
Electrical Engineering Vanderbilt University Nashville TN USA
European Federation for Medical Informatics Le Mont sur Lausanne Switzerland
Faculty of Mathematics and Computer Science Heidelberg University Heidelberg Germany
Faculty of Medicine Heidelberg University Hospital Heidelberg Germany
Frankfurt Cancer Insititute Frankfurt am Main Germany
Fraunhofer MEVIS Bremen Germany
German Cancer Research Center Heidelberg Division of Biostatistics Heidelberg Germany
German Cancer Research Center Heidelberg Division of Intelligent Medical Systems Heidelberg Germany
German Cancer Research Center Heidelberg Division of Medical Image Computing Heidelberg Germany
German Cancer Research Center Heidelberg Heidelberg Germany
German Cancer Research Center Heidelberg HI Applied Computer Vision Lab Heidelberg Germany
German Cancer Research Center Heidelberg HI Helmholtz Imaging Heidelberg Germany
German Cancer Research Center Heidelberg Interactive Machine Learning Group Heidelberg Germany
Google 1600 Amphitheatre Pkwy Mountain View CA USA
Google Health DeepMind London UK
Google Health Google Palo Alto CA USA
Helmholtz AI Oberschleißheim Germany
IHU Strasbourg Strasbourg France
Imaging Platform Broad Institute of MIT and Harvard Cambridge MA USA
Informatics Institute Faculty of Science University of Amsterdam Amsterdam the Netherlands
Information Systems Institute University of Applied Sciences Western Switzerland Sierre Switzerland
Institute for AI in Medicine University Medicine Essen Essen Germany
Institute for Computational Biomedicine Heidelberg University Heidelberg Germany
Institute of Information Systems Engineering TU Wien Vienna Austria
Instituto de Cálculo CONICET Universidad de Buenos Aires Buenos Aires Argentina
Laboratoire Traitement du Signal et de l'Image UMR_S 1099 Université de Rennes 1 Rennes France
Medical Faculty Heidelberg University Heidelberg Germany
Medical Faculty University of Geneva Geneva Switzerland
National Institutes of Health Clinical Center Bethesda MD USA
Neurocenter Oulu Oulu University Hospital Oulu Finland
Parietal project team INRIA Saclay Île de France Palaiseau France
Physical Sciences Sunnybrook Research Institute Toronto Ontario Canada
Princess Margaret Cancer Centre University Health Network Toronto Ontario Canada
Radboud Institute for Health Sciences Radboud University Medical Center Nijmegen the Netherlands
Research Unit of Health Sciences and Technology Faculty of Medicine University of Oulu Oulu Finland
School of Biomedical Engineering and Imaging Science King's College London London UK
School of Engineering The University of Edinburgh Edinburgh Scotland
Simula Metropolitan Center for Digital Engineering Oslo Norway
Technische Universität Dresden DFG Cluster of Excellence 'Physics of Life' Dresden Germany
Tissue Image Analytics Laboratory Department of Computer Science University of Warwick Coventry UK
Vector Institute for Artificial Intelligence Toronto Ontario Canada
Zobrazit více v PubMed
Adamson Adewole S and Smith Avery. Machine learning and health care disparities in dermatology, 2018. PubMed
Antonelli Michela, Reinke Annika, Bakas Spyridon, Farahani Keyvan, Kopp-Schneider Annette, Landman Bennett A, Litjens Geert, Menze Bjoern, Ronneberger Olaf, Summers Ronald M, et al. The medical segmentation decathlon. Nature Communications, 13(1):1–13, 2022. PubMed PMC
Armato Samuel G III, McLennan Geoffrey, Bidaut Luc, McNitt-Gray Michael F, Meyer Charles R, Reeves Anthony P, Zhao Binsheng, Aberle Denise R, Henschke Claudia I, Hoffman Eric A, et al. The lung image database consortium (lidc) and image database resource initiative (idri): a completed reference database of lung nodules on ct scans. Medical physics, 38(2):915–931, 2011. PubMed PMC
Badgeley MA, Zech JR, Oakden-Rayner L, Glicksberg BS, Liu M, Gale W, McConnell MV, Percha B, Snyder TM, and Dudley JT. Deep learning predicts hip fracture using confounding patient and healthcare variables. npj digit med. 2019; 2: 31, 2019. PubMed PMC
Birhane Abeba, Kalluri Pratyusha, Card Dallas, Agnew William, Dotan Ravit, and Bao Michelle. The values encoded in machine learning research. arXiv, June 2021.
Bossuyt Patrick M, Reitsma Johannes B, Bruns David E, Gatsonis Constantine A, Glasziou Paul P, Irwig Les M, Lijmer Jeroen G, Moher David, Rennie Drummond, De Vet Henrica CW, et al. Towards complete and accurate reporting of studies of diagnostic accuracy: the stard initiative. Annals of internal medicine, 138(1):40–44, 2003. PubMed
Brown Bernice B. Delphi process: a methodology used for the elicitation of opinions of experts. Technical report, Rand Corp; Santa Monica CA, 1968.
Brümmer Niko and Du Preez Johan. Application-independent evaluation of speaker detection. Computer Speech & Language, 20(2–3):230–275, 2006.
Carass Aaron, Roy Snehashis, Gherman Adrian, Reinhold Jacob C, Jesson Andrew, Arbel Tal, Maier Oskar, Handels Heinz, Ghafoorian Mohsen, Platel Bram, et al. Evaluating white matter lesion segmentations with refined sørensen-dice analysis. Scientific reports, 10(1):1–19, 2020. PubMed PMC
Char Danton S, Shah Nigam H, and Magnus David. Implementing machine learning in health care - addressing ethical challenges. N. Engl. J. Med, 378(11):981–983, March 2018. PubMed PMC
Chenouard Nicolas, Smal Ihor, De Chaumont Fabrice, Maška Martin, Sbalzarini Ivo F, Gong Yuanhao, Cardinale Janick, Carthel Craig, Coraluppi Stefano, Winter Mark, et al. Objective comparison of particle tracking methods. Nature methods, 11(3):281–289, 2014. PubMed PMC
Codella Noel, Rotemberg Veronica, Tschandl Philipp, Celebi M Emre, Dusza Stephen, Gutman David, Helba Brian, Kalloo Aadi, Liopyris Konstantinos, Marchetti Michael, et al. Skin lesion analysis toward melanoma detection 2018: A challenge hosted by the international skin imaging collaboration (isic). arXiv preprint arXiv:1902.03368, 2019.
Collins Gary S, Dhiman Paula, Andaur Navarro Constanza L, Ma Jie, Hooft Lotty, Reitsma Johannes B, Logullo Patricia, Beam Andrew L, Peng Lily, Van Calster Ben, et al. Protocol for development of a reporting guideline (tripod-ai) and risk of bias tool (probast-ai) for diagnostic and prognostic prediction model studies based on artificial intelligence. BMJ open, 11(7):e048008, 2021. PubMed PMC
Commowick Olivier, Istace Audrey, Kain Michael, Laurent Baptiste, Leray Florent, Simon Mathieu, Pop Sorina Camarasu, Girard Pascal, Ameli Roxana, Ferré Jean-Christophe, et al. Objective evaluation of multiple sclerosis lesion segmentation using a data management and processing infrastructure. Scientific reports, 8(1):1–17, 2018. PubMed PMC
CONSORT-AI and SPIRIT-AI Steering Group. Reporting guidelines for clinical trials evaluating artificial intelligence interventions are needed. Nat. Med, 25(10):1467–1468, October 2019. PubMed
Correia Paulo and Pereira Fernando. Video object relevance metrics for overall segmentation quality evaluation. EURASIP Journal on Advances in Signal Processing, 2006:1–11, 2006.
Côté Marc-Alexandre, Girard Gabriel, Boré Arnaud, Garyfallidis Eleftherios, Houde Jean-Christophe, and Descoteaux Maxime. Tractometer: towards validation of tractography pipelines. Medical Image Analysis, 17(7):844–857, October 2013. ISSN 1361–8423. doi: 10.1016/j.media.2013.03.009. PubMed DOI
D’Amour A, Heller K, Moldovan D, Adlam B, and others. Underspecification presents challenges for credibility in modern machine learning. arXiv preprint arXiv, 2020.
Université de Montréal. The Declaration - Montreal Responsible AI, 2017. URL https://www.montrealdeclaration-responsibleai.com/the-declaration.
Ellis David G, Alvarez Carlos M, and Aizenberg Michele R. Qualitative criteria for feasible cranial implant designs. In Cranial Implant Design Challenge, pages 8–18. Springer, 2021.
Ferrer Luciana. Analysis and comparison of classification metrics. arXiv preprint arXiv:2209.05355, 2022.
Geirhos Robert, Jacobsen Jörn-Henrik, Michaelis Claudio, Zemel Richard, Brendel Wieland, Bethge Matthias, and Wichmann Felix A. Shortcut learning in deep neural networks. Nature Machine Intelligence, 2(11):665–673, November 2020.
Gooding Mark J, Smith Annamarie J, Tariq Maira, Aljabar Paul, Peressutti Devis, van der Stoep Judith, Reymen Bart, Emans Daisy, Hattu Djoya, van Loon Judith, et al. Comparative evaluation of autocontouring in clinical practice: a practical method using the turing test. Medical physics, 45(11):5105–5115, 2018. PubMed
Gruber Sebastian Gregor and Buettner Florian. Better uncertainty calibration via proper scores for classification and beyond. In Advances in Neural Information Processing Systems, 2022.
Haugen Trine B, Hicks Steven A, Andersen Jorunn M, Witczak Oliwia, Hammer Hugo L, Borgli Rune, Halvorsen Pål, and Riegler Michael. Visem: A multimodal video dataset of human spermatozoa. In Proceedings of the 10th ACM Multimedia Systems Conference, pages 261–266, 2019.
Honauer Katrin, Maier-Hein Lena, and Kondermann Daniel. The hci stereo metrics: Geometry-aware performance analysis of stereo algorithms. In Proceedings of the IEEE International Conference on Computer Vision, pages 2120–2128, 2015.
Ibrahim Hussein, Liu Xiaoxuan, Zariffa Nevine, Morris Andrew D, and Denniston Alastair K. Health data poverty: an assailable barrier to equitable digital health care. Lancet Digit Health, 3(4):e260–e265, April 2021. PubMed
Jaeger Paul F, Lüth Carsten T, Klein Lukas, and Bungert Till J. A call to reflect on evaluation practices for failure detection in image classification. International Conference on Learning Representations, 2023.
Jäger Paul Ferdinand. Challenges and opportunities of end-to-end learning in medical image classification. Karlsruher Institut für Technologie, 2020.
Jannin Pierre. Towards responsible research in digital technology for health care. arXiv, September 2021.
Kang Feng, Jin Rong, and Sukthankar Rahul. Correlated label propagation with application to multi-label learning. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), volume 2, pages 1719–1726. IEEE, 2006.
Kelly Christopher J, Karthikesalingam Alan, Suleyman Mustafa, Corrado Greg, and King Dominic. Key challenges for delivering clinical impact with artificial intelligence. BMC medicine, 17:1–9, 2019. PubMed PMC
Khan Daanish Ali, Li Linhong, Sha Ninghao, Liu Zhuoran, Jimenez Abelino, Raj Bhiksha, and Singh Rita. Non-determinism in neural networks for adversarial robustness. arXiv preprint arXiv:1905.10906, 2019.
Kirillov Alexander, He Kaiming, Girshick Ross, Rother Carsten, and Dollár Piotr. Panoptic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9404–9413, 2019.
Kofler Florian, Ezhov Ivan, Isensee Fabian, Berger Christoph, Korner Maximilian, Paetzold Johannes, Li Hongwei, Shit Suprosanna, McKinley Richard, Bakas Spyridon, et al. Are we using appropriate segmentation metrics? Identifying correlates of human expert perception for CNN training beyond rolling the DICE coefficient. arXiv preprint arXiv:2103.06205v1, 2021.
Kofler Florian, Shit Suprosanna, Ezhov Ivan, Fidon Lucas, Al-Maskari Rami, Li Hongwei, Bhatia Harsharan, Loehr Timo, Piraud Marie, Erturk Ali, et al. blob loss: instance imbalance aware loss functions for semantic segmentation. arXiv preprint arXiv:2205.08209, 2022.
Konukoglu Ender, Glocker Ben, Ye Dong Hye, Criminisi Antonio, and Pohl Kilian M. Discriminative segmentation-based evaluation through shape dissimilarity. IEEE transactions on medical imaging, 31(12):2278–2289, 2012. PubMed PMC
Kottner Jan, Audigé Laurent, Brorson Stig, Donner Allan, Gajewski Byron J, Hróbjartsson Asbjørn, Roberts Chris, Shoukri Mohamed, and Streiner David L. Guidelines for reporting reliability and agreement studies (grras) were proposed. International journal of nursing studies, 48(6):661–671, 2011. PubMed
Lacoste Alexandre, Luccioni Alexandra, Schmidt Victor, and Dandres Thomas. Quantifying the carbon emissions of machine learning. arXiv preprint arXiv:1910.09700, 2019.
Lannelongue Loïc, Grealey Jason, and Inouye Michael. Green algorithms: quantifying the carbon footprint of computation. Advanced science, 8(12):2100707, 2021. PubMed PMC
Lavin Alexander, Gilligan-Lee Ciarán M, Visnjic Alessya, Ganju Siddha, Newman Dava, Ganguly Sujoy, Lange Danny, Baydin Atílím Güneş, Sharma Amit, Gibson Adam, et al. Technology readiness levels for machine learning systems. Nature Communications, 13(1):1–19, 2022. PubMed PMC
van Leeuwen David A and Brümmer Niko. An introduction to application-independent evaluation of speaker recognition systems. In Speaker classification I, pages 330–353. Springer, 2007.
Lennerz Jochen K, Green Ursula, Williamson Drew FK, and Mahmood Faisal. A unifying force for the realization of medical ai. npj Digital Medicine, 5(1):1–3, 2022. PubMed PMC
Liang Kung-Yee and Zeger Scott L. Longitudinal data analysis using generalized linear models. Biometrika, 73(1):13–22, 1986.
Liu Xiaoqi, Parks Kelsey, Saknite Inga, Reasat Tahsin, Cronin Austin D, Wheless Lee E, Dawant Benoit M, and Tkaczyk Eric R. Baseline photos and confident annotation improve automated detection of cutaneous graft-versus-host disease. Clinical hematology international, 3(3):108, 2021. PubMed PMC
Ljosa Vebjorn, Sokolnicki Katherine L, and Carpenter Anne E. Annotated high-throughput microscopy image sets for validation. Nature methods, 9(7):637–637, 2012. PubMed PMC
Maier-Hein Lena, Eisenmann Matthias, Reinke Annika, Onogur Sinan, Stankovic Marko, Scholz Patrick, Arbel Tal, Bogunovic Hrvoje, Bradley Andrew P, Carass Aaron, et al. Why rankings of biomedical image analysis competitions should be interpreted with care. Nature communications, 9(1):1–13, 2018. PubMed PMC
Maier-Hein Lena, Wagner Martin, Ross Tobias, Reinke Annika, Bodenstedt Sebastian, Full Peter M, Hempe Hellena, Mindroc-Filimon Diana, Scholz Patrick, Tran Thuy Nuong, et al. Heidelberg colorectal data set for surgical data science in the sensor operating room. Scientific data, 8(1):1–11, 2021. PubMed PMC
Maier-Hein Lena, Reinke Annika, Christodoulou Evangelia, Glocker Ben, Godau Patrick, Isensee Fabian, Kleesiek Jens, Kozubek Michal, Reyes Mauricio, Riegler Michael A, et al. Metrics reloaded: Pitfalls and recommendations for image analysis validation. arXiv preprint arXiv:2206.01653, 2022.
Mais Lisa, Hirsch Peter, and Kainmueller Dagmar. Patchperpix for instance segmentation. In European Conference on Computer Vision, pages 288–304. Springer, 2020.
Margolin Ran, Zelnik-Manor Lihi, and Tal Ayellet. How to evaluate foreground maps? In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 248–255, 2014.
McCradden Melissa D, Anderson James A, Stephenson Elizabeth A, Drysdale Erik, Erdman Lauren, Goldenberg Anna, and Shaul Randi Zlotnik. A research ethics framework for the clinical translation of healthcare machine learning. Am. J. Bioeth, pages 1–15, January 2022. PubMed
Meilă Marina. Comparing clusterings by the variation of information. In Learning theory and kernel machines, pages 173–187. Springer, 2003.
Meissner G, Nern A, Dorman Z, DePasquale GM, Forster K, Gibney T, Hausenfluck JH, He Y, Iyer N, Jeter J, et al. A searchable image resource of drosophila gal4-driver expression patterns with single neuron resolution. BioRxiv, page 2020.05.29.080473, 2022. PubMed PMC
Moons Karel GM, Altman Douglas G, Reitsma Johannes B, Ioannidis John PA, Macaskill Petra, Steyerberg Ewout W, Vickers Andrew J, Ransohoff David F, and Collins Gary S. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (tripod): explanation and elaboration. Annals of internal medicine, 162(1):W1–W73, 2015. PubMed
Nagao Yukiko, Sakamoto Mika, Chinen Takumi, Okada Yasushi, and Takao Daisuke. Robust classification of cell cycle phase and biological feature extraction by image-based deep learning. Molecular biology of the cell, 31(13):1346–1354, 2020. PubMed PMC
Nasa Prashant, Jain Ravi, and Juneja Deven. Delphi methodology in healthcare research: how to decide its appropriateness. World Journal of Methodology, 11(4):116, 2021. PubMed PMC
Oakden-Rayner Luke, Dunnmon Jared, Carneiro Gustavo, and Ré Christopher. Hidden stratification causes clinically meaningful failures in machine learning for medical imaging. Proc ACM Conf Health Inference Learn (2020), 2020: 151–159, April 2020. PubMed PMC
Obermeyer Ziad, Powers Brian, Vogeli Christine, and Mullainathan Sendhil. Dissecting racial bias in an algorithm used to manage the health of populations. Science, 366(6464):447–453, October 2019. PubMed
Park Seong Ho, Han Kyunghwa, Jang Hye Young, Park Ji Eun, Lee June-Goo, Kim Dong Wook, and Choi Jaesoon. Methods for Clinical Evaluation of Artificial Intelligence Algorithms for Medical Diagnosis. Radiology, 306(1):20–31, January 2023. ISSN 0033–8419. doi: 10.1148/radiol.220182. URL https://pubs.rsna.org/doi/10.1148/radiol.220182. Publisher: Radiological Society of North America. PubMed DOI
Patterson David, Gonzalez Joseph, Le Quoc, Liang Chen, Munguia Lluis-Miquel, Rothchild Daniel, So David, Texier Maud, and Dean Jeff. Carbon emissions and large neural network training. arXiv, April 2021.
Perez-Lebel Alexandre, Le Morvan Marine, and Varoquaux Gaël. Beyond calibration: estimating the grouping loss of modern neural networks. International Conference on Learning Representations, 2023.
Rand William M. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical association, 66(336):846–850, 1971.
Reinke Annika, Eisenmann Matthias, Onogur Sinan, Stankovic Marko, Scholz Patrick, Full Peter M, Bogunovic Hrvoje, Landman Bennett A, Maier Oskar, Menze Bjoern, et al. How to exploit weaknesses in biomedical challenge design and organization. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 388–395. Springer, 2018.
Reinke Annika, Eisenmann Matthias, Tizabi Minu D, Sudre Carole H, Rädsch Tim, Antonelli Michela, Arbel Tal, Bakas Spyridon, Cardoso M Jorge, Cheplygina Veronika, et al. Common limitations of image processing metrics: A picture story. arXiv preprint arXiv:2104.05642, 2021.
Reinke Annika, Tizabi Minu D., Baumgartner Michael, Eisenmann Matthias, Heckmann-Nötzel Doreen, Kavur Emre, Rädsch Tim, Sudre Carole, et al. Understanding metric-related pitfalls in image analysis validation. arXiv preprint arXiv:2302.01790; sister publication jointly submitted with this work, 2023. PubMed PMC
Riley Richard D, Ensor Joie, Snell Kym IE, Debray Thomas PA, Altman Doug G, Moons Karel GM, and Collins Gary S. External validation of clinical prediction models using big datasets from e-health records or ipd meta-analysis: opportunities and challenges. bmj, 353, 2016. PubMed PMC
Roß Tobias, Bruno Pierangela, Reinke Annika, Wiesenfarth Manuel, Koeppel Lisa, Full Peter M, Pekdemir Bünyamin, Godau Patrick, Trofimova Darya, Isensee Fabian, et al. How can we learn (more) from challenges? a statistical approach to driving future algorithm development. arXiv preprint arXiv:2106.09302, 2021.
Sage Daniel, Kirshner Hagai, Pengo Thomas, Stuurman Nico, Min Junhong, Manley Suliana, and Unser Michael. Quantitative evaluation of software packages for single-molecule localization microscopy. Nature methods, 12(8): 717–724, 2015. PubMed
Schulam Peter and Saria Suchi. Can you trust this prediction? auditing pointwise reliability after learning. In Chaudhuri Kamalika and Sugiyama Masashi, editors, Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics, volume 89 of Proceedings of Machine Learning Research, pages 1022–1031. PMLR, 2019.
Schulz Kenneth F, Altman Douglas G, Moher David, and CONSORT Group*. Consort 2010 statement: updated guidelines for reporting parallel group randomized trials. Annals of internal medicine, 152(11):726–732, 2010. PubMed
Shah Nigam H, Milstein Arnold, and Bagley Steven C. Making machine learning models clinically useful. Jama, 322 (14):1351–1352, 2019. PubMed
Simpson Amber L, Antonelli Michela, Bakas Spyridon, Bilello Michel, Farahani Keyvan, Van Ginneken Bram, Kopp-Schneider Annette, Landman Bennett A, Litjens Geert, Menze Bjoern, et al. A large annotated medical image dataset for the development and evaluation of segmentation algorithms. arXiv preprint arXiv:1902.09063, 2019.
Sounderajah Viknesh, Ashrafian Hutan, Aggarwal Ravi, De Fauw Jeffrey, Denniston Alastair K, Greaves Felix, Karthikesalingam Alan, King Dominic, Liu Xiaoxuan, Markar Sheraz R, McInnes Matthew D F, Panch Trishan, Pearson-Stuttard Jonathan, Ting Daniel S W, Golub Robert M, Moher David, Bossuyt Patrick M, and Darzi Ara. Developing specific reporting guidelines for diagnostic accuracy studies assessing AI interventions: The STARD-AI steering group. Nat. Med, 26(6):807–808, June 2020. PubMed
Steyerberg Ewout W, Vickers Andrew J, Cook Nancy R, Gerds Thomas, Gonen Mithat, Obuchowski Nancy, Pencina Michael J, and Kattan Michael W. Assessing the performance of prediction models: a framework for some traditional and novel measures. Epidemiology (Cambridge, Mass.), 21(1):128, 2010. PubMed PMC
Strubell Emma, Ganesh Ananya, and McCallum Andrew. Energy and policy considerations for deep learning in NLP. arXiv, June 2019.
Summers Cecilia and Dinneen Michael J. Nondeterminism and instability in neural network optimization. In International Conference on Machine Learning, pages 9913–9922. PMLR, 2021.
Taha Abdel Aziz and Hanbury Allan. Metrics for evaluating 3d medical image segmentation: analysis, selection, and tool. BMC medical imaging, 15(1):1–28, 2015. PubMed PMC
Targosz Anna, Przystałka Piotr, Wiaderkiewicz Ryszard, and Mrugacz Grzegorz. Semantic segmentation of human oocyte images using deep neural networks. BioMedical Engineering OnLine, 20(1):40, 2021. PubMed PMC
The Institute for Ethical Ai and Machine Learning. The institute for ethical AI & machine learning. https://ethical.institute/principles.html, 2018. Accessed: 2022-5-21.
Tirian Laszlo and Dickson Barry J. The vt gal4, lexa, and split-gal4 driver line collections for targeted expression in the drosophila nervous system. BioRxiv, page 198648, 2017.
Tran Thuy N, Adler Tim, Yamlahi Amine, Christodoulou Evangelia, Godau Patrick, Reinke Annika, Tizabi Minu D, Sauer Peter, Persicke Tillmann, Albert Jörg G., and Maier-Hein Lena. Sources of performance variability in deep learning-based polyp detection. arXiv preprint arXiv:2211.09708, 2022. PubMed PMC
Ulman Vladimír, Maška Martin, Magnusson Klas EG, Ronneberger Olaf, Haubold Carsten, Harder Nathalie, Matula Pavel, Matula Petr, Svoboda David, Radojevic Miroslav, et al. An objective comparison of cell-tracking algorithms. Nature methods, 14(12):1141–1152, 2017. PubMed PMC
Usatine Richard and Manci Rachel. Dermoscopedia, 2021. https://dermoscopedia.org/File:DF_chinese_dms.JPG.
Vaassen Femke, Hazelaar Colien, Vaniqui Ana, Gooding Mark, van der Heyden Brent, Canters Richard, and van Elmpt Wouter. Evaluation of measures for assessing time-saving of automatic organ-at-risk segmentation in radiotherapy. Physics and Imaging in Radiation Oncology, 13:1–6, 2020. PubMed PMC
Van Hoorde Kirsten, Van Huffel Sabine, Timmerman Dirk, Bourne Tom, and Van Calster Ben. A spline-based tool to assess and visualize the calibration of multiclass risk predictions. Journal of biomedical informatics, 54:283–293, 2015. PubMed
Vickers Andrew J, Van Calster Ben, and Steyerberg Ewout W. Net benefit approaches to the evaluation of prediction models, molecular markers, and diagnostic tests. bmj, 352, 2016. PubMed PMC
Wiesenfarth Manuel, Reinke Annika, Landman Bennett A, Eisenmann Matthias, Saiz Laura Aguilera, Cardoso M Jorge, Maier-Hein Lena, and Kopp-Schneider Annette. Methods and open-source toolkit for analyzing and visualizing challenge results. Scientific reports, 11(1):1–15, 2021. PubMed PMC
Anthony Lasse F Wolff, Kanding Benjamin, and Selvan Raghavendra. Carbontracker: Tracking and predicting the carbon footprint of training deep learning models. arXiv, July 2020.
Zhang Ying, Xie Yubin, Liu Wenzhong, Deng Wankun, Peng Di, Wang Chenwei, Xu Haodong, Ruan Chen, Deng Yongjie, Guo Yaping, et al. Deepphagy: a deep learning framework for quantitatively measuring autophagy activity in saccharomyces cerevisiae. Autophagy, 16(4):626–640, 2020. PubMed PMC