Why rankings of biomedical image analysis competitions should be interpreted with care

. 2018 Dec 06 ; 9 (1) : 5217. [epub] 20181206

Jazyk angličtina Země Velká Británie, Anglie Médium electronic

Typ dokumentu časopisecké články, Research Support, N.I.H., Extramural, práce podpořená grantem, Research Support, U.S. Gov't, Non-P.H.S.

Perzistentní odkaz   https://www.medvik.cz/link/pmid30523263

Grantová podpora
Wellcome Trust - United Kingdom
MR/P015476/1 Medical Research Council - United Kingdom
R01 EB017230 NIBIB NIH HHS - United States
R01 NS070906 NINDS NIH HHS - United States

Odkazy

PubMed 30523263
PubMed Central PMC6284017
DOI 10.1038/s41467-018-07619-7
PII: 10.1038/s41467-018-07619-7
Knihovny.cz E-zdroje

International challenges have become the standard for validation of biomedical image analysis methods. Given their scientific impact, it is surprising that a critical analysis of common practices related to the organization of challenges has not yet been performed. In this paper, we present a comprehensive analysis of biomedical image analysis challenges conducted up to now. We demonstrate the importance of challenges and show that the lack of quality control has critical consequences. First, reproducibility and interpretation of the results is often hampered as only a fraction of relevant information is typically provided. Second, the rank of an algorithm is generally not robust to a number of variables such as the test data used for validation, the ranking scheme applied and the observers that make the reference annotations. To overcome these problems, we recommend best practice guidelines and define open research questions to be addressed in the future.

AIExplore NTUST Center of Computer Vision and Medical Imaging Graduate Institute of Biomedical Engineering National Taiwan University of Science and Technology Taipei 106 Taiwan

Centre for Biomedical Image Analysis Masaryk University 60200 Brno Czech Republic

Centre for Intelligent Machines McGill University Montreal QC H3A0G4 Canada

Centre for Medical Image Computing and Department of Computer Science University College London London W1W 7TS UK

Christian Doppler Laboratory for Ophthalmic Image Analysis Department of Ophthalmology Medical University Vienna 1090 Vienna Austria

CISTIB Center for Computational Imaging and Simulation Technologies in Biomedicine The University of Leeds Leeds Yorkshire LS2 9JT UK

Complexity Science Hub Vienna 1080 Vienna Austria

Data Science Studio Research Studios Austria FG 1090 Vienna Austria

Department of Computer Science University of Warwick Coventry CV4 7AL UK

Department of Electrical and Computer Engineering Department of Computer Science Johns Hopkins University Baltimore MD 21218 USA

Department of Electrical Engineering Eindhoven University of Technology 5600 MB Eindhoven The Netherlands

Department of Radiation Oncology Massachusetts General Hospital Boston MA 02114 USA

Department of Radiology and Nuclear Medicine Medical Image Analysis Radboud University Center 6525 GA Nijmegen The Netherlands

Departments of Radiology Nuclear Medicine and Medical Informatics Erasmus MC 3015 GD Rotterdam The Netherlands

Division of Biostatistics German Cancer Research Center 69120 Heidelberg Germany

Division of Clinical Epidemiology and Aging Research German Cancer Research Center 69120 Heidelberg Germany

Division of Computer Assisted Medical Interventions 69120 Heidelberg Germany

Division of Medical Image Computing 69120 Heidelberg Germany

Division of Translational Surgical Oncology National Center for Tumor Diseases Dresden 01307 Dresden Germany

Electrical Engineering Vanderbilt University Nashville TN 37235 1679 USA

Heidelberg Collaboratory for Image Processing Heidelberg University 69120 Heidelberg Germany

Information System Institute HES SO Sierre 3960 Switzerland

Institute for Advanced Studies Department of Informatics Technical University of Munich 80333 Munich Germany

Institute for Surgical Technology and Biomechanics University of Bern Bern 3014 Switzerland

Institute of Biomedical Engineering University of Oxford Oxford OX3 7DQ UK

Institute of Diagnostic and Interventional Radiology University Medical Center Rostock 18051 Rostock Germany

Institute of Information Systems Engineering TU Wien 1040 Vienna Austria

Institute of Medical Informatics Universität zu Lübeck 23562 Lübeck Germany

Science and Engineering Faculty Queensland University of Technology Brisbane QLD 4001 Australia

Univ Rennes Inserm LTSI UMR_S 1099 Rennes 35043 Cedex France

Erratum v

PubMed

Zobrazit více v PubMed

Ayache N, Duncan J. 20th anniversary of the medical image analysis journal (MedIA) Med. Image Anal. 2016;33:1–3. doi: 10.1016/j.media.2016.07.004. PubMed DOI

Chen, W. Li, W. Dong, X. Pei, J. A review of biological image analysis. Curr. Bioinform. 13, 337–343 (2018).

Price K. Anything you can do, I can do better (no you can’t) Comput. Gr. Image Process. 1986;36:387–391. doi: 10.1016/0734-189X(86)90083-6. DOI

West J, et al. Comparison and evaluation of retrospective intermodality brain image registration techniques. J. Comput. Assist. Tomogr. 1997;21:554–568. doi: 10.1097/00004728-199707000-00007. PubMed DOI

Müller H, Rosset A, Vallée JP, Terrier F, Geissbuhler A. A reference data set for the evaluation of medical image retrieval systems. Comput. Med. Imaging Graph. 2004;28:295–305. doi: 10.1016/j.compmedimag.2004.04.005. PubMed DOI

ImageCLEF/LifeCLEF – Multimedia Retrieval in CLEF. 2004. http://www.imageclef.org/. Accessed 20 Feb 2018

Kalpathy-Cramer J, et al. Evaluating performance of biomedical image retrieval systems – an overview of the medical image retrieval task at ImageCLEF 2004-2013. Comput. Med. Imaging Graph. 2015;39:55–61. doi: 10.1016/j.compmedimag.2014.03.004. PubMed DOI PMC

Cleverdon CW. The aslib cranfield research project on the comparative efficiency of indexing systems. Aslib Proc. 1960;12:421–431. doi: 10.1108/eb049778. DOI

Heimann T, et al. Comparison and evaluation of methods for liver segmentation from CT datasets. IEEE Trans. Med. Imaging. 2009;28:1251–1265. doi: 10.1109/TMI.2009.2013851. PubMed DOI

Chenouard N, et al. Objective comparison of particle tracking methods. Nat. Methods. 2014;11:281–289. doi: 10.1038/nmeth.2808. PubMed DOI PMC

Sage Daniel, et al. Quantitative evaluation of software packages for single-molecule localization microscopy. Nat. Methods. 2015;12:717–724. doi: 10.1038/nmeth.3442. PubMed DOI

Menze BH, et al. The multimodal brain tumor image segmentation benchmark (BRATS) IEEE Trans. Med. Imaging. 2015;34:1993–2024. doi: 10.1109/TMI.2014.2377694. PubMed DOI PMC

Ulman V, et al. An objective comparison of cell-tracking algorithms. Nat. Methods. 2017;14:1141. doi: 10.1038/nmeth.4473. PubMed DOI PMC

Maier-Hein KH, et al. The challenge of mapping the human connectome based on diffusion tractography. Nat. Commun. 2017;8:1349. doi: 10.1038/s41467-017-01285-x. PubMed DOI PMC

Setio AAA, et al. Validation, comparison, and combination of algorithms for automatic detection of pulmonary nodules in computed tomography images: the LUNA16 challenge. Med. Image Anal. 2017;42:1–13. doi: 10.1016/j.media.2017.06.015. PubMed DOI

Zheng G, et al. Evaluation and comparison of 3D intervertebral disc localization and segmentation methods for 3D T2 MR data: a grand challenge. Med. Image Anal. 2017;35:327–344. doi: 10.1016/j.media.2016.08.005. PubMed DOI

Wang CW, et al. A benchmark for comparison of dental radiography analysis algorithms. Med. Image Anal. 2016;31:63–76. doi: 10.1016/j.media.2016.02.004. PubMed DOI

Bernal J, et al. Comparative validation of polyp detection methods in video colonoscopy: results from the MICCAI 2015 Endoscopic Vision Challenge. IEEE Trans. Med. Imaging. 2017;36:1231–1249. doi: 10.1109/TMI.2017.2664042. PubMed DOI

Sirinukunwattana K, et al. Gland segmentation in colon histology images: The glas challenge contest. Med. Image Anal. 2017;35:489–502. doi: 10.1016/j.media.2016.08.008. PubMed DOI

Maier O, et al. ISLES 2015-A public evaluation benchmark for ischemic stroke lesion segmentation from multispectral MRI. Med. Image Anal. 2017;35:250–269. doi: 10.1016/j.media.2016.07.009. PubMed DOI PMC

Carass A, et al. Longitudinal multiple sclerosis lesion segmentation: resource and challenge. NeuroImage. 2017;148:77–102. doi: 10.1016/j.neuroimage.2016.12.064. PubMed DOI PMC

Wang CW, et al. Evaluation and comparison of anatomical landmark detection methods for cephalometric x-ray images: a grand challenge. IEEE Trans. Med. Imaging. 2015;34:1890–1900. doi: 10.1109/TMI.2015.2412951. PubMed DOI

Bernard O, et al. Standardized evaluation system for left ventricular segmentation algorithms in 3D echocardiography. IEEE Trans. Med. Imaging. 2016;35:967–977. doi: 10.1109/TMI.2015.2503890. PubMed DOI

Bron EE, et al. Standardized evaluation of algorithms for computer-aided diagnosis of dementia based on structural MRI: the CADDementia challenge. NeuroImage. 2015;111:562–579. doi: 10.1016/j.neuroimage.2015.01.048. PubMed DOI PMC

Jimenez-del-Toro O, et al. Cloud-based evaluation of anatomical structure segmentation and landmark detection algorithms: VISCERAL anatomy benchmarks. IEEE Trans. Med. Imaging. 2016;35:2459–2475. doi: 10.1109/TMI.2016.2578680. PubMed DOI

Hogeweg L, et al. Clavicle segmentation in chest radiographs. Med. Image Anal. 2012;16:1490–1502. doi: 10.1016/j.media.2012.06.009. PubMed DOI

Tobon-Gomez C, et al. Benchmark for algorithms segmenting the left atrium from 3D CT and MRI datasets. IEEE Trans. Med. Imaging. 2015;34:1460–1473. doi: 10.1109/TMI.2015.2398818. PubMed DOI

Rueda S, et al. Evaluation and comparison of current fetal ultrasound image segmentation methods for biometric measurements: a grand challenge. IEEE Trans. Med. Imaging. 2014;33:797–813. doi: 10.1109/TMI.2013.2276943. PubMed DOI

Karim R, et al. Evaluation of state-of-the-art segmentation algorithms for left ventricle infarct from late Gadolinium enhancement MR images. Med. Image Anal. 2016;30:95–107. doi: 10.1016/j.media.2016.01.004. PubMed DOI

Kirişli HA, et al. Standardized evaluation framework for evaluating coronary artery stenosis detection, stenosis quantification and lumen segmentation algorithms in computed tomography angiography. Med. Image Anal. 2013;17:859–876. doi: 10.1016/j.media.2013.05.007. PubMed DOI

Küffner R, et al. Crowdsourced analysis of clinical trial data to predict amyotrophic lateral sclerosis progression. Nat. Biotechnol. 2015;33:51. doi: 10.1038/nbt.3051. PubMed DOI

Daducci A, et al. Quantitative comparison of reconstruction methods for intra-voxel fiber recovery from diffusion MRI. IEEE Trans. Med. Imaging. 2014;33:384–399. doi: 10.1109/TMI.2013.2285500. PubMed DOI

Išgum I, et al. Evaluation of automatic neonatal brain segmentation algorithms: the NeoBrainS12 challenge. Med. Image Anal. 2015;20:135–151. doi: 10.1016/j.media.2014.11.001. PubMed DOI

Foggia P, Percannella G, Soda P, Vento M. Benchmarking HEp-2 cells classification methods. IEEE Trans. Med. Imaging. 2013;32:1878–1889. doi: 10.1109/TMI.2013.2268163. PubMed DOI

Litjens G, et al. Evaluation of prostate segmentation algorithms for MRI: the PROMISE12 challenge. Med. Image Anal. 2014;18:359–373. doi: 10.1016/j.media.2013.12.002. PubMed DOI PMC

Petitjean C, et al. Right ventricle segmentation from cardiac MRI: a collation study. Med. Image Anal. 2015;19:187–202. doi: 10.1016/j.media.2014.10.004. PubMed DOI

Rudyanto RD, et al. Comparing algorithms for automated vessel segmentation in computed tomography scans of the lung: the VESSEL12 study. Med. Image Anal. 2014;18:1217–1232. doi: 10.1016/j.media.2014.07.003. PubMed DOI PMC

Tobon-Gomez C, et al. Benchmarking framework for myocardial tracking and deformation algorithms: an open access database. Med. Image Anal. 2013;17:632–648. doi: 10.1016/j.media.2013.03.008. PubMed DOI

Murphy K, et al. Evaluation of registration methods on thoracic CT: the EMPIRE10 challenge. IEEE Trans. Med. Imaging. 2011;30:1901–1920. doi: 10.1109/TMI.2011.2158349. PubMed DOI

Van Ginneken B, et al. Comparing and combining algorithms for computer-aided detection of pulmonary nodules in computed tomography scans: the ANODE09 study. Med. Image Anal. 2010;14:707–722. doi: 10.1016/j.media.2010.05.005. PubMed DOI

Lo P, et al. Extraction of airways from CT (EXACT'09) IEEE Trans. Med. Imaging. 2012;31:2093–2107. doi: 10.1109/TMI.2012.2209674. PubMed DOI

Niemeijer M, et al. Retinopathy online challenge: automatic detection of microaneurysms in digital color fundus photographs. IEEE Trans. Med. Imaging. 2010;29:185–195. doi: 10.1109/TMI.2009.2033909. PubMed DOI

Hameeteman K, et al. Evaluation framework for carotid bifurcation lumen segmentation and stenosis grading. Med. Image Anal. 2011;15:477–488. doi: 10.1016/j.media.2011.02.004. PubMed DOI

Schaap M, et al. Standardized evaluation methodology and reference database for evaluating coronary artery centerline extraction algorithms. Med. Image Anal. 2009;13:701–714. doi: 10.1016/j.media.2009.06.003. PubMed DOI PMC

Kaggle Inc. The Home of Data Science & Machine Learning. https://www.kaggle.com/. Accessed 20 Feb 2018 (2010).

Tassey, G., Rowe, B. R., Wood, D. W., Link, A. N. & Simoni, D. A. Economic impact assessment of NIST’s text retrieval conference (TREC) program. Technical Report 0211875, RTI International (2010).

Tsikrika, T., Herrera, A. G. S. de & Müller, H. Assessing the scholarly impact of ImageCLEF. In Multilingual and Multimodal Information Access Evaluation 95–106 (Springer, Berlin, Heidelberg, 2011).

Russakovsky O, et al. ImageNET large scale visual recognition challenge. Int. J. Comput. Vis. 2015;115:211–252. doi: 10.1007/s11263-015-0816-y. DOI

Grünberg, K. et al. Annotating Medical Image Data. in Cloud-Based Benchmarking of Med. Image Anal. 45–67 (Springer, Cham, 2017).

Dice LR. Measures of the amount of ecologic association between species. Ecology. 1945;26:297–302. doi: 10.2307/1932409. DOI

Huttenlocher DP, Klanderman GA, Rucklidge WJ. Comparing images using the Hausdorff distance. IEEE Trans. Pattern Anal. Mach. Intell. 1993;15:850–863. doi: 10.1109/34.232073. DOI

Dubuisson, M.-P. & Anil K. J. A modified Hausdorff distance for object matching. In Proc. IEEE Int. Conf. Pattern Recognit.566–568 (IEEE, Jerusalem, 1994).

Kendall MG. A new measure of rank correlation. Biometrika. 1938;30:81–93. doi: 10.1093/biomet/30.1-2.81. DOI

Sculley, D., Snoek, J., Rahimi, A., & Wiltschko, A. Winner’s curse? On pace, progress, and empirical rigor. in Proc. Int. Conf. Mach. Learn. Workshop (2018).

Barnes, D., Wilkerson, T., & Stephan, M. Contributing to the development of grand challenges in maths education. in Proc. Int. Congress on Math. Educ. 703–704 (Springer, Cham, 2017).

NCTM Research Committee.. Grand challenges and opportunities in mathematics education research. J. Res. Math. Educ. 2017;46:134–146. doi: 10.5951/jresematheduc.46.2.0134. DOI

Dream Challenges. DREAM Challenges. http://dreamchallenges.org/. Accessed16 July 2018 (2006)

Lipton, Z. C. & Steinhardt, J. Troubling trends in machine learning scholarship. Preprint at https://arxiv.org/abs/1807.03341 (2018).

Munafò MR, et al. A manifesto for reproducible science. Nat. Hum. Behav. 2017;1:0021. doi: 10.1038/s41562-016-0021. PubMed DOI PMC

Ioannidis JPA. Why most published research findings are false. PLoS Med. 2005;2:e124. doi: 10.1371/journal.pmed.0020124. PubMed DOI PMC

Armstrong, T. G., Moffat, A., Webber, W. & Zobel, J. Improvements that don’t add up: ad-hoc retrieval results since 1998. in Proc. 18th ACM conference on Information and knowledge management. 601–610 (ACM, New York, 2009).

Blanco, R. & Zaragoza, H. Beware of relatively large but meaningless improvements. Tech. Rep., Yahoo! Research YL-2011-001 (2011).

Boutros PC, Margolin AA, Stuart JM, Califano A, Stolovitzky G. Toward better benchmarking: challenge-based methods assessment in cancer genomics. Genome Biol. 2014;15:462. doi: 10.1186/s13059-014-0462-7. PubMed DOI PMC

Jannin P, Grova C, Maurer CR. Model for defining and reporting reference-based validation protocols in medical image processing. Int. J. CARS. 2006;1:63–73. doi: 10.1007/s11548-006-0044-6. DOI

Langville, A. N. & Carl D. Meyer. Who’s #1? The Science of Rating and Ranking. (Princeton University Press, Princeton, New Jersey, 2012).

Maier-Hein, L. et al. Is the winner really the best? A critical analysis of common research practice in biomedical image analysis competitions (Version 1.0.0) [Data set]. Zenodo. 10.5281/zenodo.1453313 (2018).

Najít záznam

Citační ukazatele

Nahrávání dat ...

Možnosti archivace

Nahrávání dat ...