-
A sensitivity analysis to quantify the impact of neuroimaging preprocessing strategies on subsequent statistical analyses
Authors:
Brice Ozenne,
Martin Norgaard,
Cyril Pernet,
Melanie Ganz
Abstract:
Even though novel imaging techniques have been successful in studying brain structure and function, the measured biological signals are often contaminated by multiple sources of noise, arising due to e.g. head movements of the individual being scanned, limited spatial/temporal resolution, or other issues specific to each imaging technology. Data preprocessing (e.g. denoising) is therefore critical…
▽ More
Even though novel imaging techniques have been successful in studying brain structure and function, the measured biological signals are often contaminated by multiple sources of noise, arising due to e.g. head movements of the individual being scanned, limited spatial/temporal resolution, or other issues specific to each imaging technology. Data preprocessing (e.g. denoising) is therefore critical. Preprocessing pipelines have become increasingly complex over the years, but also more flexible, and this flexibility can have a significant impact on the final results and conclusions of a given study. This large parameter space is often referred to as multiverse analyses. Here, we provide conceptual and practical tools for statistical analyses that can aggregate multiple pipeline results along with a new sensitivity analysis testing for hypotheses across pipelines such as "no effect across all pipelines" or "at least one pipeline with no effect". The proposed framework is generic and can be applied to any multiverse scenario, but we illustrate its use based on positron emission tomography data.
△ Less
Submitted 24 April, 2024; v1 submitted 23 April, 2024;
originally announced April 2024.
-
Are demographically invariant models and representations in medical imaging fair?
Authors:
Eike Petersen,
Enzo Ferrante,
Melanie Ganz,
Aasa Feragen
Abstract:
Medical imaging models have been shown to encode information about patient demographics such as age, race, and sex in their latent representation, raising concerns about their potential for discrimination. Here, we ask whether requiring models not to encode demographic attributes is desirable. We point out that marginal and class-conditional representation invariance imply the standard group fairn…
▽ More
Medical imaging models have been shown to encode information about patient demographics such as age, race, and sex in their latent representation, raising concerns about their potential for discrimination. Here, we ask whether requiring models not to encode demographic attributes is desirable. We point out that marginal and class-conditional representation invariance imply the standard group fairness notions of demographic parity and equalized odds, respectively. In addition, however, they require matching the risk distributions, thus potentially equalizing away important group differences. Enforcing the traditional fairness notions directly instead does not entail these strong constraints. Moreover, representationally invariant models may still take demographic attributes into account for deriving predictions, implying unequal treatment - in fact, achieving representation invariance may require doing so. In theory, this can be prevented using counterfactual notions of (individual) fairness or invariance. We caution, however, that properly defining medical image counterfactuals with respect to demographic attributes is fraught with challenges. Finally, we posit that encoding demographic attributes may even be advantageous if it enables learning a task-specific encoding of demographic features that does not rely on social constructs such as 'race' and 'gender.' We conclude that demographically invariant representations are neither necessary nor sufficient for fairness in medical imaging. Models may need to encode demographic attributes, lending further urgency to calls for comprehensive model fairness assessments in terms of predictive performance across diverse patient groups.
△ Less
Submitted 3 July, 2024; v1 submitted 2 May, 2023;
originally announced May 2023.
-
Why is the winner the best?
Authors:
Matthias Eisenmann,
Annika Reinke,
Vivienn Weru,
Minu Dietlinde Tizabi,
Fabian Isensee,
Tim J. Adler,
Sharib Ali,
Vincent Andrearczyk,
Marc Aubreville,
Ujjwal Baid,
Spyridon Bakas,
Niranjan Balu,
Sophia Bano,
Jorge Bernal,
Sebastian Bodenstedt,
Alessandro Casella,
Veronika Cheplygina,
Marie Daum,
Marleen de Bruijne,
Adrien Depeursinge,
Reuben Dorent,
Jan Egger,
David G. Ellis,
Sandy Engelhardt,
Melanie Ganz
, et al. (100 additional authors not shown)
Abstract:
International benchmarking competitions have become fundamental for the comparative performance assessment of image analysis methods. However, little attention has been given to investigating what can be learnt from these competitions. Do they really generate scientific progress? What are common and successful participation strategies? What makes a solution superior to a competing method? To addre…
▽ More
International benchmarking competitions have become fundamental for the comparative performance assessment of image analysis methods. However, little attention has been given to investigating what can be learnt from these competitions. Do they really generate scientific progress? What are common and successful participation strategies? What makes a solution superior to a competing method? To address this gap in the literature, we performed a multi-center study with all 80 competitions that were conducted in the scope of IEEE ISBI 2021 and MICCAI 2021. Statistical analyses performed based on comprehensive descriptions of the submitted algorithms linked to their rank as well as the underlying participation strategies revealed common characteristics of winning solutions. These typically include the use of multi-task learning (63%) and/or multi-stage pipelines (61%), and a focus on augmentation (100%), image preprocessing (97%), data curation (79%), and postprocessing (66%). The "typical" lead of a winning team is a computer scientist with a doctoral degree, five years of experience in biomedical image analysis, and four years of experience in deep learning. Two core general development strategies stood out for highly-ranked teams: the reflection of the metrics in the method design and the focus on analyzing and handling failure cases. According to the organizers, 43% of the winning algorithms exceeded the state of the art but only 11% completely solved the respective domain problem. The insights of our study could help researchers (1) improve algorithm development strategies when approaching new problems, and (2) focus on open research questions revealed by this work.
△ Less
Submitted 30 March, 2023;
originally announced March 2023.
-
On (assessing) the fairness of risk score models
Authors:
Eike Petersen,
Melanie Ganz,
Sune Hannibal Holm,
Aasa Feragen
Abstract:
Recent work on algorithmic fairness has largely focused on the fairness of discrete decisions, or classifications. While such decisions are often based on risk score models, the fairness of the risk models themselves has received considerably less attention. Risk models are of interest for a number of reasons, including the fact that they communicate uncertainty about the potential outcomes to use…
▽ More
Recent work on algorithmic fairness has largely focused on the fairness of discrete decisions, or classifications. While such decisions are often based on risk score models, the fairness of the risk models themselves has received considerably less attention. Risk models are of interest for a number of reasons, including the fact that they communicate uncertainty about the potential outcomes to users, thus representing a way to enable meaningful human oversight. Here, we address fairness desiderata for risk score models. We identify the provision of similar epistemic value to different groups as a key desideratum for risk score fairness. Further, we address how to assess the fairness of risk score models quantitatively, including a discussion of metric choices and meaningful statistical comparisons between groups. In this context, we also introduce a novel calibration error metric that is less sample size-biased than previously proposed metrics, enabling meaningful comparisons between groups of different sizes. We illustrate our methodology - which is widely applicable in many other settings - in two case studies, one in recidivism risk prediction, and one in risk of major depressive disorder (MDD) prediction.
△ Less
Submitted 22 February, 2023; v1 submitted 17 February, 2023;
originally announced February 2023.
-
Biomedical image analysis competitions: The state of current participation practice
Authors:
Matthias Eisenmann,
Annika Reinke,
Vivienn Weru,
Minu Dietlinde Tizabi,
Fabian Isensee,
Tim J. Adler,
Patrick Godau,
Veronika Cheplygina,
Michal Kozubek,
Sharib Ali,
Anubha Gupta,
Jan Kybic,
Alison Noble,
Carlos Ortiz de Solórzano,
Samiksha Pachade,
Caroline Petitjean,
Daniel Sage,
Donglai Wei,
Elizabeth Wilden,
Deepak Alapatt,
Vincent Andrearczyk,
Ujjwal Baid,
Spyridon Bakas,
Niranjan Balu,
Sophia Bano
, et al. (331 additional authors not shown)
Abstract:
The number of international benchmarking competitions is steadily increasing in various fields of machine learning (ML) research and practice. So far, however, little is known about the common practice as well as bottlenecks faced by the community in tackling the research questions posed. To shed light on the status quo of algorithm development in the specific field of biomedical imaging analysis,…
▽ More
The number of international benchmarking competitions is steadily increasing in various fields of machine learning (ML) research and practice. So far, however, little is known about the common practice as well as bottlenecks faced by the community in tackling the research questions posed. To shed light on the status quo of algorithm development in the specific field of biomedical imaging analysis, we designed an international survey that was issued to all participants of challenges conducted in conjunction with the IEEE ISBI 2021 and MICCAI 2021 conferences (80 competitions in total). The survey covered participants' expertise and working environments, their chosen strategies, as well as algorithm characteristics. A median of 72% challenge participants took part in the survey. According to our results, knowledge exchange was the primary incentive (70%) for participation, while the reception of prize money played only a minor role (16%). While a median of 80 working hours was spent on method development, a large portion of participants stated that they did not have enough time for method development (32%). 25% perceived the infrastructure to be a bottleneck. Overall, 94% of all solutions were deep learning-based. Of these, 84% were based on standard architectures. 43% of the respondents reported that the data samples (e.g., images) were too large to be processed at once. This was most commonly addressed by patch-based training (69%), downsampling (37%), and solving 3D analysis tasks as a series of 2D tasks. K-fold cross-validation on the training set was performed by only 37% of the participants and only 50% of the participants performed ensembling based on multiple identical models (61%) or heterogeneous models (39%). 48% of the respondents applied postprocessing steps.
△ Less
Submitted 12 September, 2023; v1 submitted 16 December, 2022;
originally announced December 2022.
-
Feature robustness and sex differences in medical imaging: a case study in MRI-based Alzheimer's disease detection
Authors:
Eike Petersen,
Aasa Feragen,
Maria Luise da Costa Zemsch,
Anders Henriksen,
Oskar Eiler Wiese Christensen,
Melanie Ganz
Abstract:
Convolutional neural networks have enabled significant improvements in medical image-based diagnosis. It is, however, increasingly clear that these models are susceptible to performance degradation when facing spurious correlations and dataset shift, leading, e.g., to underperformance on underrepresented patient groups. In this paper, we compare two classification schemes on the ADNI MRI dataset:…
▽ More
Convolutional neural networks have enabled significant improvements in medical image-based diagnosis. It is, however, increasingly clear that these models are susceptible to performance degradation when facing spurious correlations and dataset shift, leading, e.g., to underperformance on underrepresented patient groups. In this paper, we compare two classification schemes on the ADNI MRI dataset: a simple logistic regression model using manually selected volumetric features, and a convolutional neural network trained on 3D MRI data. We assess the robustness of the trained models in the face of varying dataset splits, training set sex composition, and stage of disease. In contrast to earlier work in other imaging modalities, we do not observe a clear pattern of improved model performance for the majority group in the training dataset. Instead, while logistic regression is fully robust to dataset composition, we find that CNN performance is generally improved for both male and female subjects when including more female subjects in the training dataset. We hypothesize that this might be due to inherent differences in the pathology of the two sexes. Moreover, in our analysis, the logistic regression model outperforms the 3D CNN, emphasizing the utility of manual feature specification based on prior knowledge, and the need for more robust automatic feature selection.
△ Less
Submitted 14 July, 2022; v1 submitted 4 April, 2022;
originally announced April 2022.
-
Multi-view consensus CNN for 3D facial landmark placement
Authors:
Rasmus R. Paulsen,
Kristine Aavild Juhl,
Thilde Marie Haspang,
Thomas Hansen,
Melanie Ganz,
Gudmundur Einarsson
Abstract:
The rapid increase in the availability of accurate 3D scanning devices has moved facial recognition and analysis into the 3D domain. 3D facial landmarks are often used as a simple measure of anatomy and it is crucial to have accurate algorithms for automatic landmark placement. The current state-of-the-art approaches have yet to gain from the dramatic increase in performance reported in human pose…
▽ More
The rapid increase in the availability of accurate 3D scanning devices has moved facial recognition and analysis into the 3D domain. 3D facial landmarks are often used as a simple measure of anatomy and it is crucial to have accurate algorithms for automatic landmark placement. The current state-of-the-art approaches have yet to gain from the dramatic increase in performance reported in human pose tracking and 2D facial landmark placement due to the use of deep convolutional neural networks (CNN). Development of deep learning approaches for 3D meshes has given rise to the new subfield called geometric deep learning, where one topic is the adaptation of meshes for the use of deep CNNs. In this work, we demonstrate how methods derived from geometric deep learning, namely multi-view CNNs, can be combined with recent advances in human pose tracking. The method finds 2D landmark estimates and propagates this information to 3D space, where a consensus method determines the accurate 3D face landmark position. We utilise the method on a standard 3D face dataset and show that it outperforms current methods by a large margin. Further, we demonstrate how models trained on 3D range scans can be used to accurately place anatomical landmarks in magnetic resonance images.
△ Less
Submitted 14 October, 2019;
originally announced October 2019.
-
Dining Philosophers, Leader Election and Ring Size problems, in the quantum setting
Authors:
Dorit Aharonov,
Maor Ganz,
Loick Magnin
Abstract:
We provide the first quantum (exact) protocol for the Dining Philosophers problem (DP), a central problem in distributed algorithms. It is well known that the problem cannot be solved exactly in the classical setting. We then use our DP protocol to provide a new quantum protocol for the tightly related problem of exact leader election (LE) on a ring, improving significantly in both time and memory…
▽ More
We provide the first quantum (exact) protocol for the Dining Philosophers problem (DP), a central problem in distributed algorithms. It is well known that the problem cannot be solved exactly in the classical setting. We then use our DP protocol to provide a new quantum protocol for the tightly related problem of exact leader election (LE) on a ring, improving significantly in both time and memory complexity over the known LE protocol by Tani et. al. To do this, we show that in some sense the exact DP and exact LE problems are equivalent; interestingly, in the classical non-exact setting they are not. Hopefully, the results will lead to exact quantum protocols for other important distributed algorithmic questions; in particular, we discuss interesting connections to the ring size problem, as well as to a physically motivated question of breaking symmetry in 1D translationally invariant systems.
△ Less
Submitted 29 July, 2017; v1 submitted 4 July, 2017;
originally announced July 2017.
-
Quantum coin hedging, and a counter measure
Authors:
Maor Ganz,
Or Sattath
Abstract:
A quantum board game is a multi-round protocol between a single quantum player against the quantum board. Molina and Watrous discovered quantum hedging. They gave an example for perfect quantum hedging: a board game with winning probability < 1, such that the player can win with certainty at least 1-out-of-2 quantum board games played in parallel. Here we show that perfect quantum hedging occurs i…
▽ More
A quantum board game is a multi-round protocol between a single quantum player against the quantum board. Molina and Watrous discovered quantum hedging. They gave an example for perfect quantum hedging: a board game with winning probability < 1, such that the player can win with certainty at least 1-out-of-2 quantum board games played in parallel. Here we show that perfect quantum hedging occurs in a cryptographic protocol - quantum coin flipping. For this reason, when cryptographic protocols are composed, hedging may introduce serious challenges into their analysis.
We also show that hedging cannot occur when playing two-outcome board games in sequence. This is done by showing a formula for the value of sequential two-outcome board games, which depends only on the optimal value of a single board game; this formula applies in a more general setting, in which hedging is only a special case.
△ Less
Submitted 16 June, 2017; v1 submitted 10 March, 2017;
originally announced March 2017.
-
Approximate False Positive Rate Control in Selection Frequency for Random Forest
Authors:
Ender Konukoglu,
Melanie Ganz
Abstract:
Random Forest has become one of the most popular tools for feature selection. Its ability to deal with high-dimensional data makes this algorithm especially useful for studies in neuroimaging and bioinformatics. Despite its popularity and wide use, feature selection in Random Forest still lacks a crucial ingredient: false positive rate control. To date there is no efficient, principled and computa…
▽ More
Random Forest has become one of the most popular tools for feature selection. Its ability to deal with high-dimensional data makes this algorithm especially useful for studies in neuroimaging and bioinformatics. Despite its popularity and wide use, feature selection in Random Forest still lacks a crucial ingredient: false positive rate control. To date there is no efficient, principled and computationally light-weight solution to this shortcoming. As a result, researchers using Random Forest for feature selection have to resort to using heuristically set thresholds on feature rankings. This article builds an approximate probabilistic model for the feature selection process in random forest training, which allows us to compute an estimated false positive rate for a given threshold on selection frequency. Hence, it presents a principled way to determine thresholds for the selection of relevant features without any additional computational load. Experimental analysis with synthetic data demonstrates that the proposed approach can limit false positive rates on the order of the desired values and keep false negative rates low. Results show that this holds even in the presence of a complex correlation structure between features. Its good statistical properties and light-weight computational needs make this approach widely applicable to feature selection for a wide-range of applications.
△ Less
Submitted 10 October, 2014;
originally announced October 2014.