Why rankings of biomedical image analysis competitions should be interpreted with care
Jazyk angličtina Země Velká Británie, Anglie Médium electronic
Typ dokumentu časopisecké články, Research Support, N.I.H., Extramural, práce podpořená grantem, Research Support, U.S. Gov't, Non-P.H.S.
Grantová podpora
Wellcome Trust - United Kingdom
MR/P015476/1
Medical Research Council - United Kingdom
R01 EB017230
NIBIB NIH HHS - United States
R01 NS070906
NINDS NIH HHS - United States
PubMed
30523263
PubMed Central
PMC6284017
DOI
10.1038/s41467-018-07619-7
PII: 10.1038/s41467-018-07619-7
Knihovny.cz E-zdroje
- MeSH
- biomedicínské technologie klasifikace metody normy MeSH
- biomedicínský výzkum metody normy MeSH
- diagnostické zobrazování klasifikace metody normy MeSH
- hodnocení biomedicínských technologií metody normy MeSH
- lidé MeSH
- počítačové zpracování obrazu metody normy MeSH
- průzkumy a dotazníky MeSH
- reprodukovatelnost výsledků MeSH
- Check Tag
- lidé MeSH
- Publikační typ
- časopisecké články MeSH
- práce podpořená grantem MeSH
- Research Support, N.I.H., Extramural MeSH
- Research Support, U.S. Gov't, Non-P.H.S. MeSH
International challenges have become the standard for validation of biomedical image analysis methods. Given their scientific impact, it is surprising that a critical analysis of common practices related to the organization of challenges has not yet been performed. In this paper, we present a comprehensive analysis of biomedical image analysis challenges conducted up to now. We demonstrate the importance of challenges and show that the lack of quality control has critical consequences. First, reproducibility and interpretation of the results is often hampered as only a fraction of relevant information is typically provided. Second, the rank of an algorithm is generally not robust to a number of variables such as the test data used for validation, the ranking scheme applied and the observers that make the reference annotations. To overcome these problems, we recommend best practice guidelines and define open research questions to be addressed in the future.
Centre for Biomedical Image Analysis Masaryk University 60200 Brno Czech Republic
Centre for Intelligent Machines McGill University Montreal QC H3A0G4 Canada
Complexity Science Hub Vienna 1080 Vienna Austria
Data Science Studio Research Studios Austria FG 1090 Vienna Austria
Department of Computer Science University of Warwick Coventry CV4 7AL UK
Department of Radiation Oncology Massachusetts General Hospital Boston MA 02114 USA
Division of Biostatistics German Cancer Research Center 69120 Heidelberg Germany
Division of Computer Assisted Medical Interventions 69120 Heidelberg Germany
Division of Medical Image Computing 69120 Heidelberg Germany
Electrical Engineering Vanderbilt University Nashville TN 37235 1679 USA
Heidelberg Collaboratory for Image Processing Heidelberg University 69120 Heidelberg Germany
Information System Institute HES SO Sierre 3960 Switzerland
Institute for Surgical Technology and Biomechanics University of Bern Bern 3014 Switzerland
Institute of Biomedical Engineering University of Oxford Oxford OX3 7DQ UK
Institute of Information Systems Engineering TU Wien 1040 Vienna Austria
Institute of Medical Informatics Universität zu Lübeck 23562 Lübeck Germany
Science and Engineering Faculty Queensland University of Technology Brisbane QLD 4001 Australia
Univ Rennes Inserm LTSI UMR_S 1099 Rennes 35043 Cedex France
Zobrazit více v PubMed
Ayache N, Duncan J. 20th anniversary of the medical image analysis journal (MedIA) Med. Image Anal. 2016;33:1–3. doi: 10.1016/j.media.2016.07.004. PubMed DOI
Chen, W. Li, W. Dong, X. Pei, J. A review of biological image analysis. Curr. Bioinform. 13, 337–343 (2018).
Price K. Anything you can do, I can do better (no you can’t) Comput. Gr. Image Process. 1986;36:387–391. doi: 10.1016/0734-189X(86)90083-6. DOI
West J, et al. Comparison and evaluation of retrospective intermodality brain image registration techniques. J. Comput. Assist. Tomogr. 1997;21:554–568. doi: 10.1097/00004728-199707000-00007. PubMed DOI
Müller H, Rosset A, Vallée JP, Terrier F, Geissbuhler A. A reference data set for the evaluation of medical image retrieval systems. Comput. Med. Imaging Graph. 2004;28:295–305. doi: 10.1016/j.compmedimag.2004.04.005. PubMed DOI
ImageCLEF/LifeCLEF – Multimedia Retrieval in CLEF. 2004. http://www.imageclef.org/. Accessed 20 Feb 2018
Kalpathy-Cramer J, et al. Evaluating performance of biomedical image retrieval systems – an overview of the medical image retrieval task at ImageCLEF 2004-2013. Comput. Med. Imaging Graph. 2015;39:55–61. doi: 10.1016/j.compmedimag.2014.03.004. PubMed DOI PMC
Cleverdon CW. The aslib cranfield research project on the comparative efficiency of indexing systems. Aslib Proc. 1960;12:421–431. doi: 10.1108/eb049778. DOI
Heimann T, et al. Comparison and evaluation of methods for liver segmentation from CT datasets. IEEE Trans. Med. Imaging. 2009;28:1251–1265. doi: 10.1109/TMI.2009.2013851. PubMed DOI
Chenouard N, et al. Objective comparison of particle tracking methods. Nat. Methods. 2014;11:281–289. doi: 10.1038/nmeth.2808. PubMed DOI PMC
Sage Daniel, et al. Quantitative evaluation of software packages for single-molecule localization microscopy. Nat. Methods. 2015;12:717–724. doi: 10.1038/nmeth.3442. PubMed DOI
Menze BH, et al. The multimodal brain tumor image segmentation benchmark (BRATS) IEEE Trans. Med. Imaging. 2015;34:1993–2024. doi: 10.1109/TMI.2014.2377694. PubMed DOI PMC
Ulman V, et al. An objective comparison of cell-tracking algorithms. Nat. Methods. 2017;14:1141. doi: 10.1038/nmeth.4473. PubMed DOI PMC
Maier-Hein KH, et al. The challenge of mapping the human connectome based on diffusion tractography. Nat. Commun. 2017;8:1349. doi: 10.1038/s41467-017-01285-x. PubMed DOI PMC
Setio AAA, et al. Validation, comparison, and combination of algorithms for automatic detection of pulmonary nodules in computed tomography images: the LUNA16 challenge. Med. Image Anal. 2017;42:1–13. doi: 10.1016/j.media.2017.06.015. PubMed DOI
Zheng G, et al. Evaluation and comparison of 3D intervertebral disc localization and segmentation methods for 3D T2 MR data: a grand challenge. Med. Image Anal. 2017;35:327–344. doi: 10.1016/j.media.2016.08.005. PubMed DOI
Wang CW, et al. A benchmark for comparison of dental radiography analysis algorithms. Med. Image Anal. 2016;31:63–76. doi: 10.1016/j.media.2016.02.004. PubMed DOI
Bernal J, et al. Comparative validation of polyp detection methods in video colonoscopy: results from the MICCAI 2015 Endoscopic Vision Challenge. IEEE Trans. Med. Imaging. 2017;36:1231–1249. doi: 10.1109/TMI.2017.2664042. PubMed DOI
Sirinukunwattana K, et al. Gland segmentation in colon histology images: The glas challenge contest. Med. Image Anal. 2017;35:489–502. doi: 10.1016/j.media.2016.08.008. PubMed DOI
Maier O, et al. ISLES 2015-A public evaluation benchmark for ischemic stroke lesion segmentation from multispectral MRI. Med. Image Anal. 2017;35:250–269. doi: 10.1016/j.media.2016.07.009. PubMed DOI PMC
Carass A, et al. Longitudinal multiple sclerosis lesion segmentation: resource and challenge. NeuroImage. 2017;148:77–102. doi: 10.1016/j.neuroimage.2016.12.064. PubMed DOI PMC
Wang CW, et al. Evaluation and comparison of anatomical landmark detection methods for cephalometric x-ray images: a grand challenge. IEEE Trans. Med. Imaging. 2015;34:1890–1900. doi: 10.1109/TMI.2015.2412951. PubMed DOI
Bernard O, et al. Standardized evaluation system for left ventricular segmentation algorithms in 3D echocardiography. IEEE Trans. Med. Imaging. 2016;35:967–977. doi: 10.1109/TMI.2015.2503890. PubMed DOI
Bron EE, et al. Standardized evaluation of algorithms for computer-aided diagnosis of dementia based on structural MRI: the CADDementia challenge. NeuroImage. 2015;111:562–579. doi: 10.1016/j.neuroimage.2015.01.048. PubMed DOI PMC
Jimenez-del-Toro O, et al. Cloud-based evaluation of anatomical structure segmentation and landmark detection algorithms: VISCERAL anatomy benchmarks. IEEE Trans. Med. Imaging. 2016;35:2459–2475. doi: 10.1109/TMI.2016.2578680. PubMed DOI
Hogeweg L, et al. Clavicle segmentation in chest radiographs. Med. Image Anal. 2012;16:1490–1502. doi: 10.1016/j.media.2012.06.009. PubMed DOI
Tobon-Gomez C, et al. Benchmark for algorithms segmenting the left atrium from 3D CT and MRI datasets. IEEE Trans. Med. Imaging. 2015;34:1460–1473. doi: 10.1109/TMI.2015.2398818. PubMed DOI
Rueda S, et al. Evaluation and comparison of current fetal ultrasound image segmentation methods for biometric measurements: a grand challenge. IEEE Trans. Med. Imaging. 2014;33:797–813. doi: 10.1109/TMI.2013.2276943. PubMed DOI
Karim R, et al. Evaluation of state-of-the-art segmentation algorithms for left ventricle infarct from late Gadolinium enhancement MR images. Med. Image Anal. 2016;30:95–107. doi: 10.1016/j.media.2016.01.004. PubMed DOI
Kirişli HA, et al. Standardized evaluation framework for evaluating coronary artery stenosis detection, stenosis quantification and lumen segmentation algorithms in computed tomography angiography. Med. Image Anal. 2013;17:859–876. doi: 10.1016/j.media.2013.05.007. PubMed DOI
Küffner R, et al. Crowdsourced analysis of clinical trial data to predict amyotrophic lateral sclerosis progression. Nat. Biotechnol. 2015;33:51. doi: 10.1038/nbt.3051. PubMed DOI
Daducci A, et al. Quantitative comparison of reconstruction methods for intra-voxel fiber recovery from diffusion MRI. IEEE Trans. Med. Imaging. 2014;33:384–399. doi: 10.1109/TMI.2013.2285500. PubMed DOI
Išgum I, et al. Evaluation of automatic neonatal brain segmentation algorithms: the NeoBrainS12 challenge. Med. Image Anal. 2015;20:135–151. doi: 10.1016/j.media.2014.11.001. PubMed DOI
Foggia P, Percannella G, Soda P, Vento M. Benchmarking HEp-2 cells classification methods. IEEE Trans. Med. Imaging. 2013;32:1878–1889. doi: 10.1109/TMI.2013.2268163. PubMed DOI
Litjens G, et al. Evaluation of prostate segmentation algorithms for MRI: the PROMISE12 challenge. Med. Image Anal. 2014;18:359–373. doi: 10.1016/j.media.2013.12.002. PubMed DOI PMC
Petitjean C, et al. Right ventricle segmentation from cardiac MRI: a collation study. Med. Image Anal. 2015;19:187–202. doi: 10.1016/j.media.2014.10.004. PubMed DOI
Rudyanto RD, et al. Comparing algorithms for automated vessel segmentation in computed tomography scans of the lung: the VESSEL12 study. Med. Image Anal. 2014;18:1217–1232. doi: 10.1016/j.media.2014.07.003. PubMed DOI PMC
Tobon-Gomez C, et al. Benchmarking framework for myocardial tracking and deformation algorithms: an open access database. Med. Image Anal. 2013;17:632–648. doi: 10.1016/j.media.2013.03.008. PubMed DOI
Murphy K, et al. Evaluation of registration methods on thoracic CT: the EMPIRE10 challenge. IEEE Trans. Med. Imaging. 2011;30:1901–1920. doi: 10.1109/TMI.2011.2158349. PubMed DOI
Van Ginneken B, et al. Comparing and combining algorithms for computer-aided detection of pulmonary nodules in computed tomography scans: the ANODE09 study. Med. Image Anal. 2010;14:707–722. doi: 10.1016/j.media.2010.05.005. PubMed DOI
Lo P, et al. Extraction of airways from CT (EXACT'09) IEEE Trans. Med. Imaging. 2012;31:2093–2107. doi: 10.1109/TMI.2012.2209674. PubMed DOI
Niemeijer M, et al. Retinopathy online challenge: automatic detection of microaneurysms in digital color fundus photographs. IEEE Trans. Med. Imaging. 2010;29:185–195. doi: 10.1109/TMI.2009.2033909. PubMed DOI
Hameeteman K, et al. Evaluation framework for carotid bifurcation lumen segmentation and stenosis grading. Med. Image Anal. 2011;15:477–488. doi: 10.1016/j.media.2011.02.004. PubMed DOI
Schaap M, et al. Standardized evaluation methodology and reference database for evaluating coronary artery centerline extraction algorithms. Med. Image Anal. 2009;13:701–714. doi: 10.1016/j.media.2009.06.003. PubMed DOI PMC
Kaggle Inc. The Home of Data Science & Machine Learning. https://www.kaggle.com/. Accessed 20 Feb 2018 (2010).
Tassey, G., Rowe, B. R., Wood, D. W., Link, A. N. & Simoni, D. A. Economic impact assessment of NIST’s text retrieval conference (TREC) program. Technical Report 0211875, RTI International (2010).
Tsikrika, T., Herrera, A. G. S. de & Müller, H. Assessing the scholarly impact of ImageCLEF. In Multilingual and Multimodal Information Access Evaluation 95–106 (Springer, Berlin, Heidelberg, 2011).
Russakovsky O, et al. ImageNET large scale visual recognition challenge. Int. J. Comput. Vis. 2015;115:211–252. doi: 10.1007/s11263-015-0816-y. DOI
Grünberg, K. et al. Annotating Medical Image Data. in Cloud-Based Benchmarking of Med. Image Anal. 45–67 (Springer, Cham, 2017).
Dice LR. Measures of the amount of ecologic association between species. Ecology. 1945;26:297–302. doi: 10.2307/1932409. DOI
Huttenlocher DP, Klanderman GA, Rucklidge WJ. Comparing images using the Hausdorff distance. IEEE Trans. Pattern Anal. Mach. Intell. 1993;15:850–863. doi: 10.1109/34.232073. DOI
Dubuisson, M.-P. & Anil K. J. A modified Hausdorff distance for object matching. In Proc. IEEE Int. Conf. Pattern Recognit.566–568 (IEEE, Jerusalem, 1994).
Kendall MG. A new measure of rank correlation. Biometrika. 1938;30:81–93. doi: 10.1093/biomet/30.1-2.81. DOI
Sculley, D., Snoek, J., Rahimi, A., & Wiltschko, A. Winner’s curse? On pace, progress, and empirical rigor. in Proc. Int. Conf. Mach. Learn. Workshop (2018).
Barnes, D., Wilkerson, T., & Stephan, M. Contributing to the development of grand challenges in maths education. in Proc. Int. Congress on Math. Educ. 703–704 (Springer, Cham, 2017).
NCTM Research Committee.. Grand challenges and opportunities in mathematics education research. J. Res. Math. Educ. 2017;46:134–146. doi: 10.5951/jresematheduc.46.2.0134. DOI
Dream Challenges. DREAM Challenges. http://dreamchallenges.org/. Accessed16 July 2018 (2006)
Lipton, Z. C. & Steinhardt, J. Troubling trends in machine learning scholarship. Preprint at https://arxiv.org/abs/1807.03341 (2018).
Munafò MR, et al. A manifesto for reproducible science. Nat. Hum. Behav. 2017;1:0021. doi: 10.1038/s41562-016-0021. PubMed DOI PMC
Ioannidis JPA. Why most published research findings are false. PLoS Med. 2005;2:e124. doi: 10.1371/journal.pmed.0020124. PubMed DOI PMC
Armstrong, T. G., Moffat, A., Webber, W. & Zobel, J. Improvements that don’t add up: ad-hoc retrieval results since 1998. in Proc. 18th ACM conference on Information and knowledge management. 601–610 (ACM, New York, 2009).
Blanco, R. & Zaragoza, H. Beware of relatively large but meaningless improvements. Tech. Rep., Yahoo! Research YL-2011-001 (2011).
Boutros PC, Margolin AA, Stuart JM, Califano A, Stolovitzky G. Toward better benchmarking: challenge-based methods assessment in cancer genomics. Genome Biol. 2014;15:462. doi: 10.1186/s13059-014-0462-7. PubMed DOI PMC
Jannin P, Grova C, Maurer CR. Model for defining and reporting reference-based validation protocols in medical image processing. Int. J. CARS. 2006;1:63–73. doi: 10.1007/s11548-006-0044-6. DOI
Langville, A. N. & Carl D. Meyer. Who’s #1? The Science of Rating and Ranking. (Princeton University Press, Princeton, New Jersey, 2012).
Maier-Hein, L. et al. Is the winner really the best? A critical analysis of common research practice in biomedical image analysis competitions (Version 1.0.0) [Data set]. Zenodo. 10.5281/zenodo.1453313 (2018).
Topology-preserving contourwise shape fusion
Understanding metric-related pitfalls in image analysis validation
Metrics reloaded: recommendations for image analysis validation
BIAS: Transparent reporting of biomedical image analysis challenges
BIAFLOWS: A Collaborative Framework to Reproducibly Deploy and Benchmark Bioimage Analysis Workflows