-
Why is the winner the best?
Authors:
Matthias Eisenmann,
Annika Reinke,
Vivienn Weru,
Minu Dietlinde Tizabi,
Fabian Isensee,
Tim J. Adler,
Sharib Ali,
Vincent Andrearczyk,
Marc Aubreville,
Ujjwal Baid,
Spyridon Bakas,
Niranjan Balu,
Sophia Bano,
Jorge Bernal,
Sebastian Bodenstedt,
Alessandro Casella,
Veronika Cheplygina,
Marie Daum,
Marleen de Bruijne,
Adrien Depeursinge,
Reuben Dorent,
Jan Egger,
David G. Ellis,
Sandy Engelhardt,
Melanie Ganz
, et al. (100 additional authors not shown)
Abstract:
International benchmarking competitions have become fundamental for the comparative performance assessment of image analysis methods. However, little attention has been given to investigating what can be learnt from these competitions. Do they really generate scientific progress? What are common and successful participation strategies? What makes a solution superior to a competing method? To addre…
▽ More
International benchmarking competitions have become fundamental for the comparative performance assessment of image analysis methods. However, little attention has been given to investigating what can be learnt from these competitions. Do they really generate scientific progress? What are common and successful participation strategies? What makes a solution superior to a competing method? To address this gap in the literature, we performed a multi-center study with all 80 competitions that were conducted in the scope of IEEE ISBI 2021 and MICCAI 2021. Statistical analyses performed based on comprehensive descriptions of the submitted algorithms linked to their rank as well as the underlying participation strategies revealed common characteristics of winning solutions. These typically include the use of multi-task learning (63%) and/or multi-stage pipelines (61%), and a focus on augmentation (100%), image preprocessing (97%), data curation (79%), and postprocessing (66%). The "typical" lead of a winning team is a computer scientist with a doctoral degree, five years of experience in biomedical image analysis, and four years of experience in deep learning. Two core general development strategies stood out for highly-ranked teams: the reflection of the metrics in the method design and the focus on analyzing and handling failure cases. According to the organizers, 43% of the winning algorithms exceeded the state of the art but only 11% completely solved the respective domain problem. The insights of our study could help researchers (1) improve algorithm development strategies when approaching new problems, and (2) focus on open research questions revealed by this work.
△ Less
Submitted 30 March, 2023;
originally announced March 2023.
-
Understanding metric-related pitfalls in image analysis validation
Authors:
Annika Reinke,
Minu D. Tizabi,
Michael Baumgartner,
Matthias Eisenmann,
Doreen Heckmann-Nötzel,
A. Emre Kavur,
Tim Rädsch,
Carole H. Sudre,
Laura Acion,
Michela Antonelli,
Tal Arbel,
Spyridon Bakas,
Arriel Benis,
Matthew Blaschko,
Florian Buettner,
M. Jorge Cardoso,
Veronika Cheplygina,
Jianxu Chen,
Evangelia Christodoulou,
Beth A. Cimini,
Gary S. Collins,
Keyvan Farahani,
Luciana Ferrer,
Adrian Galdran,
Bram van Ginneken
, et al. (53 additional authors not shown)
Abstract:
Validation metrics are key for the reliable tracking of scientific progress and for bridging the current chasm between artificial intelligence (AI) research and its translation into practice. However, increasing evidence shows that particularly in image analysis, metrics are often chosen inadequately in relation to the underlying research problem. This could be attributed to a lack of accessibilit…
▽ More
Validation metrics are key for the reliable tracking of scientific progress and for bridging the current chasm between artificial intelligence (AI) research and its translation into practice. However, increasing evidence shows that particularly in image analysis, metrics are often chosen inadequately in relation to the underlying research problem. This could be attributed to a lack of accessibility of metric-related knowledge: While taking into account the individual strengths, weaknesses, and limitations of validation metrics is a critical prerequisite to making educated choices, the relevant knowledge is currently scattered and poorly accessible to individual researchers. Based on a multi-stage Delphi process conducted by a multidisciplinary expert consortium as well as extensive community feedback, the present work provides the first reliable and comprehensive common point of access to information on pitfalls related to validation metrics in image analysis. Focusing on biomedical image analysis but with the potential of transfer to other fields, the addressed pitfalls generalize across application domains and are categorized according to a newly created, domain-agnostic taxonomy. To facilitate comprehension, illustrations and specific examples accompany each pitfall. As a structured body of information accessible to researchers of all levels of expertise, this work enhances global comprehension of a key topic in image analysis validation.
△ Less
Submitted 23 February, 2024; v1 submitted 3 February, 2023;
originally announced February 2023.
-
Labeling instructions matter in biomedical image analysis
Authors:
Tim Rädsch,
Annika Reinke,
Vivienn Weru,
Minu D. Tizabi,
Nicholas Schreck,
A. Emre Kavur,
Bünyamin Pekdemir,
Tobias Roß,
Annette Kopp-Schneider,
Lena Maier-Hein
Abstract:
Biomedical image analysis algorithm validation depends on high-quality annotation of reference datasets, for which labeling instructions are key. Despite their importance, their optimization remains largely unexplored. Here, we present the first systematic study of labeling instructions and their impact on annotation quality in the field. Through comprehensive examination of professional practice…
▽ More
Biomedical image analysis algorithm validation depends on high-quality annotation of reference datasets, for which labeling instructions are key. Despite their importance, their optimization remains largely unexplored. Here, we present the first systematic study of labeling instructions and their impact on annotation quality in the field. Through comprehensive examination of professional practice and international competitions registered at the MICCAI Society, we uncovered a discrepancy between annotators' needs for labeling instructions and their current quality and availability. Based on an analysis of 14,040 images annotated by 156 annotators from four professional companies and 708 Amazon Mechanical Turk (MTurk) crowdworkers using instructions with different information density levels, we further found that including exemplary images significantly boosts annotation performance compared to text-only descriptions, while solely extending text descriptions does not. Finally, professional annotators constantly outperform MTurk crowdworkers. Our study raises awareness for the need of quality standards in biomedical image analysis labeling instructions.
△ Less
Submitted 20 July, 2022;
originally announced July 2022.
-
Metrics reloaded: Recommendations for image analysis validation
Authors:
Lena Maier-Hein,
Annika Reinke,
Patrick Godau,
Minu D. Tizabi,
Florian Buettner,
Evangelia Christodoulou,
Ben Glocker,
Fabian Isensee,
Jens Kleesiek,
Michal Kozubek,
Mauricio Reyes,
Michael A. Riegler,
Manuel Wiesenfarth,
A. Emre Kavur,
Carole H. Sudre,
Michael Baumgartner,
Matthias Eisenmann,
Doreen Heckmann-Nötzel,
Tim Rädsch,
Laura Acion,
Michela Antonelli,
Tal Arbel,
Spyridon Bakas,
Arriel Benis,
Matthew Blaschko
, et al. (49 additional authors not shown)
Abstract:
Increasing evidence shows that flaws in machine learning (ML) algorithm validation are an underestimated global problem. Particularly in automatic biomedical image analysis, chosen performance metrics often do not reflect the domain interest, thus failing to adequately measure scientific progress and hindering translation of ML techniques into practice. To overcome this, our large international ex…
▽ More
Increasing evidence shows that flaws in machine learning (ML) algorithm validation are an underestimated global problem. Particularly in automatic biomedical image analysis, chosen performance metrics often do not reflect the domain interest, thus failing to adequately measure scientific progress and hindering translation of ML techniques into practice. To overcome this, our large international expert consortium created Metrics Reloaded, a comprehensive framework guiding researchers in the problem-aware selection of metrics. Following the convergence of ML methodology across application domains, Metrics Reloaded fosters the convergence of validation methodology. The framework was developed in a multi-stage Delphi process and is based on the novel concept of a problem fingerprint - a structured representation of the given problem that captures all aspects that are relevant for metric selection, from the domain interest to the properties of the target structure(s), data set and algorithm output. Based on the problem fingerprint, users are guided through the process of choosing and applying appropriate validation metrics while being made aware of potential pitfalls. Metrics Reloaded targets image analysis problems that can be interpreted as a classification task at image, object or pixel level, namely image-level classification, object detection, semantic segmentation, and instance segmentation tasks. To improve the user experience, we implemented the framework in the Metrics Reloaded online tool, which also provides a point of access to explore weaknesses, strengths and specific recommendations for the most common validation metrics. The broad applicability of our framework across domains is demonstrated by an instantiation for various biological and medical image analysis use cases.
△ Less
Submitted 23 February, 2024; v1 submitted 3 June, 2022;
originally announced June 2022.
-
Common Limitations of Image Processing Metrics: A Picture Story
Authors:
Annika Reinke,
Minu D. Tizabi,
Carole H. Sudre,
Matthias Eisenmann,
Tim Rädsch,
Michael Baumgartner,
Laura Acion,
Michela Antonelli,
Tal Arbel,
Spyridon Bakas,
Peter Bankhead,
Arriel Benis,
Matthew Blaschko,
Florian Buettner,
M. Jorge Cardoso,
Jianxu Chen,
Veronika Cheplygina,
Evangelia Christodoulou,
Beth Cimini,
Gary S. Collins,
Sandy Engelhardt,
Keyvan Farahani,
Luciana Ferrer,
Adrian Galdran,
Bram van Ginneken
, et al. (68 additional authors not shown)
Abstract:
While the importance of automatic image analysis is continuously increasing, recent meta-research revealed major flaws with respect to algorithm validation. Performance metrics are particularly key for meaningful, objective, and transparent performance assessment and validation of the used automatic algorithms, but relatively little attention has been given to the practical pitfalls when using spe…
▽ More
While the importance of automatic image analysis is continuously increasing, recent meta-research revealed major flaws with respect to algorithm validation. Performance metrics are particularly key for meaningful, objective, and transparent performance assessment and validation of the used automatic algorithms, but relatively little attention has been given to the practical pitfalls when using specific metrics for a given image analysis task. These are typically related to (1) the disregard of inherent metric properties, such as the behaviour in the presence of class imbalance or small target structures, (2) the disregard of inherent data set properties, such as the non-independence of the test cases, and (3) the disregard of the actual biomedical domain interest that the metrics should reflect. This living dynamically document has the purpose to illustrate important limitations of performance metrics commonly applied in the field of image analysis. In this context, it focuses on biomedical image analysis problems that can be phrased as image-level classification, semantic segmentation, instance segmentation, or object detection task. The current version is based on a Delphi process on metrics conducted by an international consortium of image analysis experts from more than 60 institutions worldwide.
△ Less
Submitted 6 December, 2023; v1 submitted 12 April, 2021;
originally announced April 2021.
-
Cross-Domain Segmentation with Adversarial Loss and Covariate Shift for Biomedical Imaging
Authors:
Bora Baydar,
Savas Ozkan,
A. Emre Kavur,
N. Sinem Gezer,
M. Alper Selver,
Gozde Bozdagi Akar
Abstract:
Despite the widespread use of deep learning methods for semantic segmentation of images that are acquired from a single source, clinicians often use multi-domain data for a detailed analysis. For instance, CT and MRI have advantages over each other in terms of imaging quality, artifacts, and output characteristics that lead to differential diagnosis. The capacity of current segmentation techniques…
▽ More
Despite the widespread use of deep learning methods for semantic segmentation of images that are acquired from a single source, clinicians often use multi-domain data for a detailed analysis. For instance, CT and MRI have advantages over each other in terms of imaging quality, artifacts, and output characteristics that lead to differential diagnosis. The capacity of current segmentation techniques is only allow to work for an individual domain due to their differences. However, the models that are capable of working on all modalities are essentially needed for a complete solution. Furthermore, robustness is drastically affected by the number of samples in the training step, especially for deep learning models. Hence, there is a necessity that all available data regardless of data domain should be used for reliable methods. For this purpose, this manuscript aims to implement a novel model that can learn robust representations from cross-domain data by encapsulating distinct and shared patterns from different modalities. Precisely, covariate shift property is retained with structural modification and adversarial loss where sparse and rich representations are obtained. Hence, a single parameter set is used to perform cross-domain segmentation task. The superiority of the proposed method is that no information related to modalities are provided in either training or inference phase. The tests on CT and MRI liver data acquired in routine clinical workflows show that the proposed model outperforms all other baseline with a large margin. Experiments are also conducted on Covid-19 dataset that it consists of CT data where significant intra-class visual differences are observed. Similarly, the proposed method achieves the best performance.
△ Less
Submitted 8 June, 2020;
originally announced June 2020.
-
Basic Ensembles of Vanilla-Style Deep Learning Models Improve Liver Segmentation From CT Images
Authors:
A. Emre Kavur,
Ludmila I. Kuncheva,
M. Alper Selver
Abstract:
Segmentation of the liver from 3D computer tomography (CT) images is one of the most frequently performed operations in medical image analysis. In the past decade, Deep Learning Models (DMs) have offered significant improvements over previous methods for liver segmentation. The success of DMs is usually owed to the user's expertise in deep learning as well as to intricate training procedures. The…
▽ More
Segmentation of the liver from 3D computer tomography (CT) images is one of the most frequently performed operations in medical image analysis. In the past decade, Deep Learning Models (DMs) have offered significant improvements over previous methods for liver segmentation. The success of DMs is usually owed to the user's expertise in deep learning as well as to intricate training procedures. The need for bespoke expertise limits the reproducibility of empirical studies involving DMs. Today's consensus is that an ensemble of DMs works better than the individual component DMs. In this study we set off to explore the potential of ensembles of publicly available, `vanilla-style' DM segmenters Our ensembles were created from four off-the-shelf DMs: U-Net, Deepmedic, V-Net, and Dense V-Networks. To prevent further overfitting and to keep the overall model simple, we use basic non-trainable ensemble combiners: majority vote, average, product and min/max. Our results with two publicly available data sets (CHAOS and 3Dircadb1) demonstrate that ensembles are significantly better than the individual segmenters on four widely used metrics.
△ Less
Submitted 27 January, 2020;
originally announced January 2020.
-
Abdominal multi-organ segmentation with cascaded convolutional and adversarial deep networks
Authors:
Pierre-Henri Conze,
Ali Emre Kavur,
Emilie Cornec-Le Gall,
Naciye Sinem Gezer,
Yannick Le Meur,
M. Alper Selver,
François Rousseau
Abstract:
Objective : Abdominal anatomy segmentation is crucial for numerous applications from computer-assisted diagnosis to image-guided surgery. In this context, we address fully-automated multi-organ segmentation from abdominal CT and MR images using deep learning. Methods: The proposed model extends standard conditional generative adversarial networks. Additionally to the discriminator which enforces t…
▽ More
Objective : Abdominal anatomy segmentation is crucial for numerous applications from computer-assisted diagnosis to image-guided surgery. In this context, we address fully-automated multi-organ segmentation from abdominal CT and MR images using deep learning. Methods: The proposed model extends standard conditional generative adversarial networks. Additionally to the discriminator which enforces the model to create realistic organ delineations, it embeds cascaded partially pre-trained convolutional encoder-decoders as generator. Encoder fine-tuning from a large amount of non-medical images alleviates data scarcity limitations. The network is trained end-to-end to benefit from simultaneous multi-level segmentation refinements using auto-context. Results : Employed for healthy liver, kidneys and spleen segmentation, our pipeline provides promising results by outperforming state-of-the-art encoder-decoder schemes. Followed for the Combined Healthy Abdominal Organ Segmentation (CHAOS) challenge organized in conjunction with the IEEE International Symposium on Biomedical Imaging 2019, it gave us the first rank for three competition categories: liver CT, liver MR and multi-organ MR segmentation. Conclusion : Combining cascaded convolutional and adversarial networks strengthens the ability of deep learning pipelines to automatically delineate multiple abdominal organs, with good generalization capability. Significance : The comprehensive evaluation provided suggests that better guidance could be achieved to help clinicians in abdominal image interpretation and clinical decision making.
△ Less
Submitted 26 January, 2020;
originally announced January 2020.
-
CHAOS Challenge -- Combined (CT-MR) Healthy Abdominal Organ Segmentation
Authors:
A. Emre Kavur,
N. Sinem Gezer,
Mustafa Barış,
Sinem Aslan,
Pierre-Henri Conze,
Vladimir Groza,
Duc Duy Pham,
Soumick Chatterjee,
Philipp Ernst,
Savaş Özkan,
Bora Baydar,
Dmitry Lachinov,
Shuo Han,
Josef Pauli,
Fabian Isensee,
Matthias Perkonigg,
Rachana Sathish,
Ronnie Rajan,
Debdoot Sheet,
Gurbandurdy Dovletov,
Oliver Speck,
Andreas Nürnberger,
Klaus H. Maier-Hein,
Gözde Bozdağı Akar,
Gözde Ünal
, et al. (2 additional authors not shown)
Abstract:
Segmentation of abdominal organs has been a comprehensive, yet unresolved, research field for many years. In the last decade, intensive developments in deep learning (DL) have introduced new state-of-the-art segmentation systems. In order to expand the knowledge on these topics, the CHAOS - Combined (CT-MR) Healthy Abdominal Organ Segmentation challenge has been organized in conjunction with IEEE…
▽ More
Segmentation of abdominal organs has been a comprehensive, yet unresolved, research field for many years. In the last decade, intensive developments in deep learning (DL) have introduced new state-of-the-art segmentation systems. In order to expand the knowledge on these topics, the CHAOS - Combined (CT-MR) Healthy Abdominal Organ Segmentation challenge has been organized in conjunction with IEEE International Symposium on Biomedical Imaging (ISBI), 2019, in Venice, Italy. CHAOS provides both abdominal CT and MR data from healthy subjects for single and multiple abdominal organ segmentation. Five different but complementary tasks have been designed to analyze the capabilities of current approaches from multiple perspectives. The results are investigated thoroughly, compared with manual annotations and interactive methods. The analysis shows that the performance of DL models for single modality (CT / MR) can show reliable volumetric analysis performance (DICE: 0.98 $\pm$ 0.00 / 0.95 $\pm$ 0.01) but the best MSSD performance remain limited (21.89 $\pm$ 13.94 / 20.85 $\pm$ 10.63 mm). The performances of participating models decrease significantly for cross-modality tasks for the liver (DICE: 0.88 $\pm$ 0.15 MSSD: 36.33 $\pm$ 21.97 mm) and all organs (DICE: 0.85 $\pm$ 0.21 MSSD: 33.17 $\pm$ 38.93 mm). Despite contrary examples on different applications, multi-tasking DL models designed to segment all organs seem to perform worse compared to organ-specific ones (performance drop around 5\%). Besides, such directions of further research for cross-modality segmentation would significantly support real-world clinical applications. Moreover, having more than 1500 participants, another important contribution of the paper is the analysis on shortcomings of challenge organizations such as the effects of multiple submissions and peeking phenomena.
△ Less
Submitted 7 January, 2021; v1 submitted 17 January, 2020;
originally announced January 2020.